Professional Documents
Culture Documents
2020-The Decay and Persistence of Web References
2020-The Decay and Persistence of Web References
https://www.emerald.com/insight/2059-5816.htm
Persistence of
The decay and persistence of Web
web references references
Fayaz Ahmad Loan and Ufaira Yaseen Shah
Centre of Central Asian Studies, University of Kashmir, Srinagar, India
157
Received 28 February 2020
Abstract Revised 7 April 2020
Purpose – The purpose of this study is to identify the persistence and decay of uniform resource locator Accepted 16 April 2020
(URLs) associated with Web references. The decaying of Web references is analyzed in relation to their age,
domain, technical errors and error codes.
Design/methodology/approach – The Web references of the Journal of Informetrics were selected for
analysis and interpretation to fulfill the set objectives. The references of all the scholarly articles, excluding
editorials and reviews published in the Journal of Informetrics for five years from 2007 to 2011 were recorded
in a text file. Later, the URLs were extracted from the articles to verify their accessibility in terms of
persistence and decay. The collected data were then transferred into an excel file and tabulated for further
analysis and interpretation using simple statistical techniques.
Findings – The results showed that of the total 7,409 citations retrieved from 221 articles, 358 citations
(4.8%) were Web citations. These Web citations were assessed to find their persistence and decay. The results
reveal that 115 (32.12%) Web references were missing or dead. The most common error associated with the
missing Web citations was Error 404 Page not found, contributing 60% of the total missing citations,
followed by 400 Bad Request Error (35.65%). The domain analysis of missing Web citations depicts that most
of the missing URLs were associated with the .gov domain (40%), followed by .edu (29.58%) and .com
(26.04%).
Research limitations/implications – The Web references of a single journal, namely, Journal of
Informetrics, were analyzed for five years, and hence, the generalization of findings needs to be
cautioned.
Practical implications – The URL decay is becoming a major problem in the preservation and citation of
the Web resources, and collaborative efforts are needed to reduce the decaying of URLs.
Originality/value – A good number of studies have been conducted to analyze the persistence and decay
of Web references, as it is the hot topic of research across disciplines, and this study is a step further in the
same direction.
Keywords Active links, Dead links, Error codes, URL decay, URL persistence, Web references,
URL domains, Missing references
Paper type Research paper
Introduction
The World Wide Web (WWW) is the first choice of information seekers to search for
information surpassing other sources including libraries. The novice or expert users
first dip into the ocean of online information available in the WWW before turning to
other media. The Web is becoming the first choice among scholars to find information
on current research, break scientific discoveries and to keep up with the colleagues at
other institutions (Zhao and Logan, 2002). However, citing the resources from the Web,
information seekers suffer from the disadvantage of instability. Unlike print media, the Digital Library Perspectives
URLs of the Web resources may suddenly disappear or decay and may not be available Vol. 36 No. 2, 2020
pp. 157-166
after a certain period. Wren et al. (2006) while investigating the reasons for URL decay © Emerald Publishing Limited
2059-5816
stated that it is impossible to control this process by the original website creators. The DOI 10.1108/DLP-02-2020-0013
DLP authors further stated that the best place for intervention to mitigate the URL decay
36,2 would be at the time of publication. Thus, to maintain the reliability of the cited URLs
in scientific publishing, it is necessary to continuously monitor their growth and decay
rates. The present study, therefore, attempts to study the growth and decay rates of
Web citations in one of the prominent journals in the field of library and information
sciences (LIS), i.e. Journal of Informetrics (JOI). JOI published by Elsevier focuses on
158 quantitative aspects of information science such as Bibliometrics, Scientometrics,
Webometrics, Cybermetrics, Altmetrics and Informetrics (Elsevier, 2018).
Literature review
Because of the advent of the Web, the use of Web sources in scholarly publishing has
increased. Scholars make use of more and more Web resources in their research work.
Maharana et al. (2006) argue that the increasing number of citations to Web sources in
research papers is evidence that academic and research communities are increasingly
predisposed to use electronic resources for the advancement of scholarly communication.
But the non-availability and decaying of Web resources over time have prompted many
scholars to conduct research studies on the persistence of the Web resources. Several studies
have been carried out to access the nature of Web resources associated with the scholarly
literature. In one of the earlier studies, Koehler (1999) examined 350 URLs for over
three years and concluded that 17.7% of Web sites and 31.8% of Web pages failed to
respond when searched for after 12 months. Casserly and Bird (2003) studied randomly
chosen 500 internet citations from scholarly articles in the field of LIS and concluded that
only 56% of the Web citations were permanent whereas the remaining 44% were missing
from their Web address. Spinellis (2003) investigated several thousand Web-located
references cited in the information systems-based literature and found an average of 1.71
Web citations per article; however, many of these Web-located citations had disappeared
and the average half-life for URLs in the articles examined was approximately four years.
Wren (2004) examined the stability and persistence of URLs published in MEDLINE with
special reference to 404 error. Of 1,630 unique URLs identified, formatting and/or spelling
errors were detected within 201 (12%) of them as published. After corrections were made,
63% of these URLs were consistently available and another 19% were available
intermittently. Goh and Ng (2007) studied the accessibility and the decay of three LIS
journals from the year 1997–2003. The study found that approximately 31% of all citations
were not accessible during the time of testing and the majority of errors were because of
missing content (HTTP Error Code 404). Wren (2008) examined the URL decay in
MEDLINE and found that the most common types of lost content were computer programs
(43%), followed by scholarly content (38%) and databases (19%) and all types of domains
were found missing but URLs published by organizations (.org) tend to be more stable.
Similar results were confirmed by Wagner et al. (2009), who studied the accessibility and
decay of the health-care management journals during 2002–2004. They further concluded
that the .edu domain was the most stable domain with the highest percentage of stability.
Bhat (2009) analyzed the Web references in five leading journals in the field of Library and
Information Science for over five years from 1998 to 2002. The results revealed that the
growth in Web citations increased from 41.6% in 1998 to 53.32% in 2002, whereas almost
32% of them were found to be missing, error 404 being the leading error associated with
them contributing to about 74% of the total missing Web citations. Tajeddini et al. (2011)
explored the availability and/or decay of URLs cited in articles of six LIS journals. The
research findings indicated that from 4,562 cited URLs, 34% had error messages mostly
related to “File error” type. Sampath et al. (2012) investigated the availability of Web
citations and their persistence in Indian LIS literature. The results revealed that 45.61% of Persistence of
citations were not accessible during the time of testing and the majority of Web citations Web
showed HTTP Error Code 404 (63.84%) and .org domain was found to have the highest
failure rates (30.29%). Saberi and Abedi (2012) conducted a survey of accessibility and
references
decay of Web citations in five open-access Institute for Scientific Institution (ISI) Journals.
Their findings revealed that initially, only 73% of the URLs were accessible but after using
complementary pathways, the percentage of accessible URLs was increased to 89%. The .
159
net domain was the most stable domain, which had the highest persistence of 96%. Sife and
Bernard (2013) examined the persistence and decay of Web citations in theses and
dissertations and found that out of 15,468 total citations 1,487 (9.6%) were Web citations in
which 862 (58%) were inaccessible. In a yet another study by Gul et al. (2014), the authors
studied the Web citation growth and decay in one of the eminent journal in the field of LIS,
Ariadne, over the period 2010–2012. The authors concluded that the early published papers
have comparatively higher missing citations than the ones published later. More than half of
the missing Web citations encountered “404 error,” followed by error 500. However, the “.
com/.co” domain was found to be the most stable domain with 95% accessibility. Kumar
et al. (2015) studied the availability and persistence of URL citations cited in two journals
published by Emerald publishers, Program and The Electronic Library during 2008 and
2012. Their study revealed that a total of 2,477 URLs (23.81%) were cited in 406 research
articles, containing a total of 10,400 citations, out of which 1,275 URLs (51.47%) were
accessible, whereas 1,202 URLs (49.53%) were inaccessible, 500 error being the prominent
error code associated with the missing Web citations. Kumar and Kumar (2017) investigated
the accessibility of URLs of citations in the articles in the DESIDOC Journal of Library and
Information Technology during 2006–2015. A total of 2,133 URL citations were identified out
of which 823 (38.58%) were not accessible and HTTP-404 was the most common error
associated with the missing URLs (643, 78.13%). Shah et al. (2018) conducted a Web
citation analysis of one of the prominent journals, Library and Information Science
Research, for ten years from 2004 to 2013. The results of their study revealed that the
Web citations showed positive growth from 11.61% in 2004 to 25.72% in 2013 among
13,468 references in 293 articles. The Web citations were mostly from organizational
websites, contributing about 42% of the total Web citations. Tajedini et al. (2018)
investigated of the currency, disappearance and half-life of 1,127 URLs of Web
resources cited by Iranian researchers and found that the per cent of inaccessible
internet addresses demonstrates that .org and .com domains were more stable and
persistent than .net, .edu and others. Parmer and Pateria (2019) conducted a study to
identify the decay and durability of Web citations by analyzing citations of articles in
the Indian Journal of Agricultural Library and Information Services published during
2012 –2016. A total of 980 citations were reported in 94 articles out of which 33.16%
were having Web citations. Of Web citations, 62.15% of URLs were accessible at the
time of testing and the remaining 37.85% of URLs were not accessible. HTTP error
message 404 “page not found’ was the irresistible error message that appeared and
represented 51.22% of all HTTP error messages. Bansal and Parmar (2020) analyzed
the accessibility and deterioration of URLs of Web documents cited in the Current
Science journal published during 2015–2016. A total of 1,724 URLs cited in the 1,564
articles were examined. It was found that 56.67% of URLs were accessible and the
remaining 43.33% of URLs were not accessible mostly due to HTTP error messages,
HTTP 404 – “file not found” (59.03%). The literature review pointed out that the
problem of persistence and decaying of Web references is continuously under
DLP investigation especially since the first decade of the 21st century and the present study
36,2 is a step forward.
Research design
Objectives
This research has been conducted to study the status of Web citations in one of the
160 prominent journals in the field of LIS, Journal of Informetrics. The research specific
objectives are:
determining the growth rate and the average number of Web citations per article;
examining the diversity of top-level domains associated with Web citations;
studying the persistence and decay of URLs associated with Web citations;
examining the various types of errors and error codes associated with missing Web
citations; and
examining the domains associated with missing Web citations.
Methodology
To fulfill the set objectives, the references of all the scholarly articles, excluding editorials
and reviews published in the Journal of Informetrics for five years from 2007–2011 were
recorded in a text file. Later, the URLs were extracted from the articles to verify their
accessibility in terms of persistence and decay. The collected data were then transferred into
an excel file and tabulated for further analysis and interpretation using simple statistical
techniques.
Data analysis
A total number of 7,409 references were obtained from 221 articles published in the Journal
of Informetrics. These references were analyzed and interpreted to reveal results.
Year Total articles Total citations Average citation/paper Offline citations (%) Web citations (%)
While analyzing the top-level domains associated with the Web citations, it was found that
most of the Web citations were from the .org domain (101, 28.21%), followed by .com/.co (96,
26.8%). The citations from .edu/.ac were 71 in number (19.8%), whereas .gov and .net
contributed a small percentage of 5.6 and 1.7, respectively (Table 3).
Year Total citations Web citations Cumulative Web citation Score RGR Doubling time
2007 749 48 48 – –
2008 1,495 64 112 0.84 0.82
2009 1,152 67 179 0.47 1.47 Table 2.
2010 2,031 95 274 0.43 1.61 Growth rate of Web
2011 1,982 84 358 0.27 2.56 citations in research
Total 7,409 358 articles
Year Total Web citations Active citations (%) Missing citations (%)
Conclusion
The present study indicated that the long-term persistence of Web sources cannot be
guaranteed. To reduce the problems of URL decay, efforts should be done at all levels.
The researchers and scholars need to take care while typing the URLs, whereas the
editors need to check the URLs carefully before publishing. Preference should be given
on the use of DOIs instead of URLs. Further, there is a need to archive Web sources at
more than one place so that the online information can be preserved for the long term
posterity of the netizens. The national digital libraries of the world should take a lead in
this direction. These libraries can archive open access and copyright-free Web sources
for long-term persistence and can avoid the decay of online documents.
References
Bansal, S. and Parmar, S. (2020), “Decay of URLs citation: a case study of current science,” Library
Philosophy and Practice, 3582, available at: https://digitalcommons.unl.edu/cgi/viewcontent.cgi?
article=6562&context=libphilprac
Bhat, M.H. (2009), “Missing web references-a case study of five scholarly journals”, Liber Quarterly,
Vol. 19 No. 2, pp. 131-139, available at: www.liberquarterly.eu/articles/10.18352/lq.7957/
Casserly, M. and Bird, J.E. (2003), “Web citation availability: analysis and implications for citations”,
American Communication Journal, Vol. 9 No. 2, available at: http://crl.acrl.org/content/64/4/300.
full.pdfþhtml
Elsevier (2018), Journal of INFORMETRICS, available at: https://www.journals.elsevier.com/journal-
of-informetrics
Goh, D.H. and Ng, P.K. (2007), “Link decay in leading information science journals”, Journal of the
American Society for Information Science and Technology, Vol. 58 No. 1, pp. 15-24, doi: 10.1002/
asi.20513.
Gul, S., Mahajan, I. and Ali, A. (2014), “The growth and decay of URL’s citation: a case of an online
library and information science journal”, Malaysian Journal of Library and Information Science,
Vol. 19 No. 3, pp. 27-39, available at: https://ajba.um.edu.my/index.php/MJLIS/article/view/1781
Hester, E.J., Heilig, L.F., Drake, A.L., Johnson, K.R., Vu, C.T., Schilling, L.M. and Dellavalle, R.P. (2004), Persistence of
“Internet citations in oncology journals: a vanishing resource”, Jnci Journal of the National
Cancer Institute, Vol. 96 No. 12, pp. 969-971.
Web
Koehler, W. (1999), “An analysis of web page and web site constancy and permanence”, Journal of the
references
American Society for Information Science, Vol. 50 No. 2, pp. 162-180.
Kumar, D.V. and Kumar, B.T.S. (2017), “Finding the unfound: recovery of missing URLs through
internet archive”, Annals of Library and Information Studies, Vol. 64, pp. 165-171.
165
Kumar, D.V., Kumar, B.T.S. and Parameshwarappa, D.R. (2015), “URL’s link rot: implications for
electronic publishing”, World Digital Libraries, Vol. 8 No. 1, pp. 59-66, doi: 10.18329/09757597/
2015/8105.
Maharana, B., Nayak, K. and Sahu, N. (2006), “Scholarly use of web resources in LIS research: a citation
analysis”, Library Review, Vol. 55 No. 9, pp. 598-607, doi: 10.1108/00242530610706789.
Moghaddam, A.I. and Saberi, M.K. (2011), “The life and death of URLs: the case of journal of the
medical library association”, Library Philosophy and Practice, available at: http://
digitalcommons.unl.edu/libphilprac/592
Parmer, S. and Pateria, R.K. (2019), “Web citations and decay of URLs: a case study of Indian journal of
agricultural library and information services”, Library Philosophy and Practice (e-Journal),
Vol. 3595, available at: https://digitalcommons.unl.edu/libphilprac/3595
Saberi, M.K. and Abedi, H. (2012), “Accessibility and decay of web citations in five open access ISI
journals”, Internet Research, Vol. 22 No. 2, pp. 234-247, doi: 10.1108/10662241211214.
Sampath, B.T., Kumar, K.R. and Raj, P. (2012), “Availability and persistence of web citations in Indian
LIS literature”, The Electronic Library, Vol. 30 No. 1, pp. 19-32, doi: 10.1108/02640471211204042.
Shah, U.Y., Khan, M.I. and Anayat, S. (2018), “Web referencing in online scholarly world: a case study
of library and information science research”, International Journal of Information Movement,
Vol. 2 No. 9, pp. 104-112.
Sife, A.S. and Bernard, R. (2013), “Persistence and decay of web citations used in theses and
dissertations available at the Sokoine national agricultural library”, Tanzania. International
Journal of Education and Development Using Information and Communication Technology,
Vol. 9 No. 2, pp. 85-94.
Spinellis, D. (2003), “The decay and failures of web references”, Communications of the Acm, Vol. 46
No. 1, pp. 71-77.
Tajeddini, O., Azimi, A., Sadatmoosavi, A. and Moghaddam, H.S. (2011), “Death of web citations: a
serious alarm for authors”, Malaysian Journal of Library and Information Science, Vol. 16 No. 3,
pp. 17-29.
Tajedini, O., Sadatmoosavi, A., Ghazizade, A. and Tajedini, A. (2018), “Investigation of the
currency, disappearance and half-life of URLs of web resources cited in Iranian researchers: a
comparative study”, International Journal of Information Science and Management, Vol. 16
No. 1, pp. 27-47.
Wagner, C., Gebremichael, M.D., Taylor, M.K. and Soltys, M.J. (2009), “Disappearing act: decay of
uniform resource locators in health care management journals”, Journal of the Medical Library
Association : Jmla, Vol. 97 No. 2, pp. 122-130, doi: 10.3163/5050.97.2.009.
Wren, J.D. (2004), “404 Not found: the stability and persistence of URLs published in MEDLINE”,
Bioinformatics, Vol. 20 No. 5, pp. 668-672, doi: 10.1093/bioinformatics/btg465.
Wren, J.D. (2008), “URL decay in MEDLINE – a 4-year follow-up study”, Bioinformatics, Vol. 24 No. 11,
pp. 1381-1385, doi: 10.1093/bioinformatics/btn127.
Wren, J.D., Johnson, K.R., Crockett, D.M., Heilig, L.F., Schilling, L.M. and Dellavalle, R.P. (2006),
“Uniform resource locator decay in dermatology journals”, Archives of Dermatology, Vol. 142
No. 9, pp. 1147-1152, available at: www.ncbi.nlm.nih.gov/m/pubmed/16983002/
DLP Zhao, D. and Logan, E. (2002), “Citation analysis of scientific publications on the web: a case study in
XML research area”, Scientometrics, Vol. 54 No. 3, pp. 449-472, doi: 10.1023/A:1016090601710.
36,2
Further reading
Sellitto, C. (2004), “A study of missing web-cites in scholarly articles: towards an evaluation
framework”, Journal of Information Science, Vol. 30 No. 6, pp. 484-495, doi: 10.1177/
0165551504047822.
166
Corresponding author
Fayaz Ahmad Loan can be contacted at: fayazlib@yahoo.co.in
For instructions on how to order reprints of this article, please visit our website:
www.emeraldgrouppublishing.com/licensing/reprints.htm
Or contact us for further details: permissions@emeraldinsight.com