You are on page 1of 22

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/280303439

Open access and sources of full-text articles in Google Scholar in different


subject fields

Article  in  Scientometrics · July 2015


DOI: 10.1007/s11192-015-1642-2

CITATIONS READS

54 3,341

2 authors:

Hamid R. Jamali Majid Nabavi


Charles Sturt University Shiraz University
149 PUBLICATIONS   3,242 CITATIONS    8 PUBLICATIONS   68 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Trustworthiness in scholarly communication View project

Ph.D. thesis View project

All content following this page was uploaded by Hamid R. Jamali on 25 July 2015.

The user has requested enhancement of the downloaded file.


This is the post-print (after peer review) of the following paper is in press in Scientometrics: Please
cite as: Jamali, H. R., and Nabavi, M. (in press). Open access and sources of full-text articles
in Google Scholar in different subject fields, Scientometrics, DOI: 10.1007/s11192-015-
1642-2 (http://dx.doi.org/10.1007/s11192-015-1642-2)

The supplementary data of this article is also freely available on ResearchGate:


https://www.researchgate.net/publication/280384298_List_of_queries

Open access and sources of full-text articles in Google Scholar in different


subject fields

Hamid R. Jamali

Corresponding Author
Associate Professor
Address: Department of Library and Information Studies, Faculty of Psychology and
Education, Kharazmi University, No. 49, P.O. Box: 15614, Tehran, Iran.
Email: h.jamali@gmail.com
Phone: 00989127248119

Majid Nabavi

PhD student
Address: Iranian Research Institute for Information Science and Technology,
No. 1090, Felestin Intersection Enghelab Ave, P.O.Box:13185-1371, Tehran, Iran.
Email: m.nabavi@students.irandoc.ac.ir

Abstract
Google Scholar, a widely used academic search engine, plays a major role in finding free full-text
versions of articles. But little is known about the sources of full-text files in Google Scholar. The aim
of the study was to find out about the sources of full-text items and to look at subject differences in
terms of number of versions, times cited, rate of open access availability and sources of full-text
files. Three queries were created for each of 277 minor subject categories of Scopus. The queries
were searched in Google Scholar and the first ten hits for each query were analyzed. Citations and
patents were excluded from the results and the time frame was limited to 2004-2014. Results
showed that 61.1% of articles were accessible in full-text in Google Scholar; 80.8% of full-text articles
were publisher versions and 69.2% of full-text articles were PDF. There was a significant difference
between the means of times cited of full text items and non-full-text items. The highest rate of full

1
text availability for articles belonged to life science (66.9%). Publishers’ websites were the main
source of bibliographic information for non-full-text articles. For full-text articles, educational (edu,
ac.xx etc) and org domains were top two sources of full text files. ResearchGate was the top single
website providing full-text files (10.5% of full-text articles).

Keywords: Google Scholar, Open access, full text version, citation version, citation number,
Scopus subject categories

2
Introduction
Google Scholar (GS), an academic search engine launched in 2004, is widely used by students
(Cothran, 2011) and academics (Lercher, 2008, Ollé and Borrego, 2010). It is estimated that its size is
about 160 million documents (Orduña-Malea et al., 2014). Besides its comprehensive coverage (de
Winter, Zadpoor and Dodou, 2014; Harzing, 2014), one reason for its use is that it helps users find
full-text of items (Lercher, 2008). Studies also show that compared to the Internet and library
databases, GS search results are as good as or better than library databases (Haya, Nygren and
Widmark, 2007; Howland et al., 2009). GS lists all versions of an item it finds in different locations
including repositories, academic and publisher websites. It also provides the URL of the full-text of
items if found anywhere, whether in pre-print format or in its final published format. Therefore, GS
not only provides links to Gold Open Access (OA) articles (articles published in OA by journals), but
also provides links to green OA articles (those whose pre-prints are deposited in repositories).
Moreover, it provides the link to those full-text items that authors may have illegally, against their
copyright agreement with publishers, posted on a site such as ResearchGate. Therefore, given the
coverage of GS (which includes repositories, main vehicle of green road to open access) and its full-
text feature, it is a useful tool for finding out about the current state or evolution of OA and for
estimating the rate of full-text availability of journal articles. Past studies (Björk et al. 2010; Gargouri
et al. 2012) have shown variability in the rate of OA and full-text availability among different subject
fields. The most recent estimation (Archambault et al., 2013) shows that 48% of 2008 articles were
available as OA in 2012. However, the state of knowledge about OA growth and full-text availability
is far from perfect and more research is needed. This study aims to contribute in this area.

Another aspect of this study is the lack of knowledge about Google Scholar itself, which is due to the
general secrecy of Google products. GS does not list the specific resources it covers (Hartman &
Mullen, 2008). We do not know about the sources of full-text items in different subject fields and
whether they are preprint or final publisher version. Therefore, in addition to the estimation of the
amount of availability of full-text content per discipline, this study helps gain a better understanding
of Google Scholar as a popular one-stop-shop for academic content. More specifically current
research objectives are:

1. To determine the rate of full-text availability of items in GS according to disciplines;


2. To identify the sources of bibliographic information and full-text files in GS;
3. To determine any possible correlation between full-text availability, number of versions
and number of citations.

Literature review
There are some studies on the estimation of OA and a rich body of evidence to support citation
advantage of OA articles. GS plays a significant role in making full-text items findable and accessible
and this in turn might impact on the citation advantage. GS (and Google) has also played an
important role in studies on the estimation of OA such as Björk et al. (2010), Archambault et al.,
(2013), Khabsa & Giles (2014), and Martin-Martin et al. (2014). However, only a few studies have

3
focused on GS itself and past studies tell little about the sources of full-text files presented in the
results of GS. In this literature review, first studies on open access are reviewed and then studies on
Google Scholar are presented to cover the two aspects of the current study.

Open access
There are studies on the promotion of open access in different countries or different subject fields
(e.g. Charles & Booth, 2011; Miguel et al., 2013), however, there are fewer on the estimation of the
rate of open access availability of journal articles in different subject fields. Björk et al. (2010)
studied a random sample of 1,837 articles and searched them using Google. They found out that
8.5% of the articles published in 2008 were available as OA through publishers’websites. Copies of
another 11.9% of the articles were findable through Google. Chemistry had the lowest (13%) and
geological sciences had the highest (33%) rate of OA availability. OA journals were more popular in
medical sciences and biochemistry, and in many other subject fields the common method of full-text
availability was making a copy available by author through a website. Gargouri et al (2012)
compared the growth of gold and green OA in 14 subject fields. They selected a random sample of
1300 articles (out of 12,500 ISI ranked journals) for the two periods of 1998-2006 and 2005-2010.
Searches for the first period were conducted in 2009 and for the second period in 2011. Their results
showed that green OA with 212.4% had a bigger growth in all subject fields except biomedicine
compared to gold OA (2.4%). Their conclusion was that the overall growth of OA was slow and about
1% per annum. A large scale study in this area is the one by Science Metrix (Archambault et al.,
2013). The study included a pilot section and large scale study. In the pilot study, 20,000 articles
published in 2008 and indexed in Scopus were selected and checked for OA availability in GS and
Google. Thirty-two per cent of them were available in OA, out of which a random sample 500 articles
were checked in more details. The outcome was that 48% of 2008 articles were available as OA in
2012. In the large scale study they randomly chose 320,000 articles from Scopus including 40,000
articles for each year from 2004 to 2011. The results showed that overall Gold OA has had a 24%
growth in Scopus but its annual growth was about 2%.

Khabsa and Giles (2014) used GS and Microsoft Academic Search to estimate the number of
scholarly documents available on the web. Their estimates showed that at least 114 million English-
language scholarly documents are accessible on the web, of which Google Scholar has nearly 100
million. They maintained that at least 27 million (24%) are freely available. They also found 12-50%
significant variability in the percentage of freely available documents among different fields.

There have been several studies on the citation advantage of OA articles over non-OA articles since
early studies such as Antelman (2004) and Eysenbach (2006). An annotated list of such studies is
available (Wagner, 2010) which shows there are more than 40 studies up to 2014 that have found
evidence for the citation advantage of OA while only five studies have found no such evidence. The
overall conclusion from all past studies, which is also confirmed by the large scale study of
Archambault et al. (2013), is that OA articles do have a citation advantage. Archambault et al. (2013)
showed that OA articles were between 26% and 64% more cited on average for any given years than
all papers combined, whereas non-OA received between 17% and 33% fewer citations.

Google Scholar
4
There is a good number of studies on GS and a search in Scopus retrieves more than 200 articles
with ‘Google Scholar’ in their titles up to January 2015. These studies on GS cover a range of topics,
from early general discussions about its pros and cons as an academic search engine (Noruzi, 2005;
Jacsó, 2005, 2008; ) to its comparison with other search engines for article retrieval (Falagas et al.,
2008), its comparison with other citation indexes as a source of scientometric data (for instance for
calculating h-index, e.g. Meho& Yang, 2007; Bar-Ilan, 2008; Kulkarni et al., 2009; Sanni and Zainab,
2010), its potential for citation studies (Kousha and Thelwall, 2007, 2008; Aguillo, 2012), and its
coverage (Neuhaus et al., 2006; Walters, 2007).

A few studies have dealt with the details of the full-text availability of items in GS. Christianson
(2007) searched for 840 articles from core ecology journals in GS and showed that 9% of articles
were accessible in full-text off-campus while this figure was 38% on-campus. Older articles were less
likely to be represented and highly cited articles were more likely to be included in GS. Full-text
articles were concentrated at author sites and at a small number of provider sites. Norris,
Oppenheim and Rowland (2008) tested the coverage of Google and Google Scholar for finding copies
and reported that 86% of the copies could be found using either Google or GS. Archambault et al.
(2013) maintained the recall of GS for articles is imperfect.

Pitol and De Groote (2014) analyzed GS listings for 982 articles in several academic subjects from
three American universities for GS version types, including any institutional repository versions,
citation rates, and availability of free full-text. Their findings showed that OA articles were cited
more than articles that were not available in free full-text. While journal publishers’ websites were
indexed most often, only a small number of those articles were available as free full-text. They found
no correlation between the number of versions of an article and the number of times an article has
been cited. They maintained that viewing the ‘versions’ of an article might be useful when publisher
access is restricted, as over 70 percent of articles had at least one free full-text version available
through an indexed GS version.

The most relevant and similar study to the current study is the work by Martin-Martin et al. (2014).
They used 64 queries to check the availability of all 64,000 highly cited documents in GS and showed
that 40% of them would be accessed freely using GS and most of these documents were accessible
through universities and other nonprofit organizations. They found out that 18% of the items
retrieved were books and the number of retrieved books was greater in recent years. The average
citation per item was greater for books (2700) than for articles (1700); 86% of full-text items were in
PDF format; nih.gov and ResearchGate were the top two full-text providers; and edu, org and com
were the top three TLD (Top Level Domains) for full-text links. There was no correlation between the
document's number of versions and the number of their citations.

Method
Scopus categorization of journals which has three levels was used. The first level includes four broad
categories of Social Sciences, Physical Sciences, Life Sciences, and Health Sciences. Each of these is
then broken into some subcategories and each of the subcategories is broken into some sub-
subcategories. The categories in the third level that were multidisciplinary or miscellaneous (such as
Psychology (Miscellaneous)) were removed. Then the authors devised three queries for each of the

5
remaining 277 sub-subcategories. Library of congress subject headings, Mesh subject headings, Web
of Science subject categories, Wikipedia related articles, and articles in related journals were
consulted for devising queries related to subject categories. The appendix (the Excel file available as
Supplementary Material in the online version of the article) includes the list of queries and Scopus
subject categories. Examples of queries for the four main categories are given below and Table 1
shows the distribution of queries and subject categories:

 ‘International relations theory’ for the sub-sub-category of‘Political Science and


International Relations’ of ‘Social sciences (all)’ of ‘Social sciences’.
 ‘Wastewater treatment’ for the sub-sub-category of‘Environmental Engineering’of
‘Environmental Sciences’of ‘Physical sciences’.
 ‘Craniosacral therapy’ for the sub-sub-category of‘Complementary and Manual Therapy’ of
‘Health professions’ of ‘Health sciences’.
 ‘Biological evolution’ forthe sub-sub-category of‘Ecology, Evolution, Behavior and
Systematics’ of ‘Agricultural and Biological Sciences’ of ‘’Life sciences.

Table 1. Number of queries and subject categories

Number of sub- Number of sub Number of


Subject Category
category sub-category queries
Life Sciences 5 41 123
Physical Sciences 10 92 276
Social Sciences 6 52 156
Health Sciences 5 92 276
Total 26 277 831

Data collection process took place in two steps. First, queries (search phrases) were created for all
individual sub-subcategories. When researchers were not sure about the suitability of the query, an
initial quick and dirty search was done in GS using the query to make sure that a) the query was
relevant to the subject for which it was created, b) it resulted in the retrieval of journal articles
mainly (and not books). The focus of this study is on journal articles. GS does not have an option to
exclude books from search results and if the queries were too broad, they could result in the
retrieval of many books, especially in social sciences and humanities where books are more popular.
Therefore, general or broader terms that were more likely to appear in book titles were avoided.

In the second step, each query was searched in GS, limiting the publication year to 2004-2014, and
unchecking the two options of ‘include patents’ and ‘include citations’. Most of the queries were
two-word phrases but some also included three words or just one word. All queries were searched
in quotation marks “” for a more accurate retrieval. The searches were done during April 2014 off-
campus, so that any library subscription would not affect the full-text availability. For each query,
only the first ten hits (the first page of the results) were considered for the analysis. The following
items were recorded for each of the first 10 hits:

 Times cited (Cited by X)


6
 Number of versions mentioned on the results page
 Number of versions with links to full-text
 Number of ‘Citation’ versions
 Actual number of versions
 URL of the full-text shown on the main results page
 Publication year
 Domain from which GS obtained the information shown on the main results page
 File format of the full-text (PDF, HTML, Doc)
 Item type (article, book, other)
 Version type of the full-text (publisher version, preprint)

To record information, for each of the first 10 results, the searchers selected the link to the full-text
if available and investigated the file. They also selected the ‘All X versions’ link shown below each
item and counted the actual number of versions shown, the number of versions with links to full-
text, and the number of versions with ‘*CITATION+’ sign. Figure 1 shows the search options of GS.

The second author and seven library and information science masters’ students did the searchers.
Students were trained for the job by the senior author and a written guide was also prepared for
them to have in the case they needed during the data collection process. To make sure all searches
were done soundly and the information was recorded correctly a random sample (10%) of searches
were repeated by the senior author and the results were checked against the recorded information.
The outcome confirmed the soundness of the information collected. The default sort option, ‘sort by
relevance’ which sorts the results based on their relevance to queries, was used.

Figure 1. Screenshot of GS search results page

7
Findings
Characteristics of the items

From 8,310 items analyzed, 87.2% (7,244 items) were journal articles, 12.6% (1,044) were books and
the remaining 22 items (0.2%) were of other types such as report or thesis. As expected, books are
more popular in social sciences, with 482 (31.2%) of items being books. The smallest number of
books belongs to health sciences with only 5%. Table 2 relates. In the rest of the article, the term
‘items’ refer to all item types and the term ‘articles’ refers only to articles (excluding books and other
document types).

Table 2. Number of articles and books by broad subject category

Life Physical Social Health


sciences Sciences Sciences Sciences
Total

Count 1,151 2,412 1,065 2,616 7,244


Article % within
93.7 87.5 68.8 95.0 87.4
Subject
Count 78 346 482 138 1,044
Book % within
6.3 12.5 31.2 5.0 12.6
Subject
Count 1,229 2,758 1,547 2,754 8,288
Total % within
100 100 100 100 100
Subject
X2 = 671.6 df = 3, p < 0.001

OA and full-text availability

Full-text files of more than half of the items (57.3%, 4,765) were available and the rest (42.7%,
3,545) did not have full-text. If items are restricted only to articles the rate of full-text availability will
be 61.1% (4,426). In terms of file format of full-texts, 3,295 (69.2%) were PDF, 1,446 (30.3%) were
HTML, and another 24 (0.5%) were DOC.

Restricting items to articles only, 80.8% of full-text articles were final publisher version (Table 3), and
another 14.4% were preprint version. Less than one per cent, 42 items, were not full-text and the
links that GS provided as full-text links led to abstract pages of articles on journal websites. There
were also a number of html errors including error 404 (Page not found error). In the case of errors
(e.g. 404) the researchers could not verify that the links GS provided had actually pointed to full-text

8
files, however, since GS provided those as full-text URLS, the researchers assumed that they had
been full-text links once.

Table 3. Versions of full-text articles

Type of version N %

Publisher 3,578 80.8


Preprint 637 14.4
Abstract 42 0.9
No access 34 0.8
404 111 2.5
Other errors (403, 500, 505, 509, 550) 24 0.5
Total 4,426 100

Looking at differences among the four broad subject categories, the highest rate of full-text
availability for articles belongs to life sciences (66.9%). The rates of full-text availability for the other
three subject categories including physical, social and health sciences are very close. See Table 4.

Table 4. Availability of articles according to broad subject category

Life Physical Social Health


sciences Sciences Sciences Sciences
Total

Count 381 965 417 1,055 2,818


No Full-
text % within
33.1 40.0 39.2 40.3 38.9
Subject
Count 770 1,447 648 1,561 4,426
Full-text % within
66.9 60.0 60.8 59.7 61.1
Subject
Count 1,151 2,412 1,065 2,616 7,244
Total % within
100 100 100 100 100
Subject
2
X = 19.8 df = 3, p < 0.001; % within subject is the fraction of items only in that subject category.

Health sciences not only accounted for the highest rate of full-text availability, but also the majority
of full-text articles (99.1%) were publisher version of the articles (Table 5). The smallest number of
publisher versions belongs to physical sciences (68.8%). Some fields in physical sciences (e.g. physics
and astronomy) have a rich pre-print culture and have used pre-prints for over 40 years (Brown,
2001).

Table 5. Number of full-text articles by type of version and broad subject category

Life Physical Social Health


Total
sciences Sciences Sciences Sciences
Publisher Count 701 975 427 1,475 3,578

9
version % within
96.0 68.8 73.9 99.1 84.9
Subject
Count 29 443 151 14 637
Preprint
Version % within
4.0 31.2 26.1 00.9 15.1
Subject
Count 730 1,418 578 1,489 4,215
Total
% within
100 100 100 100 100
Subject
2
X =645.9 df = 3, p < 0.001

Sources of information

The top 13 websites (with more than 50 items) from which GS has extracted the bibliographic
information of the items not available in full-text are presented in Table 6. These 13 websites are the
sources for the information of 75% (2,666 out of 3,445) of non-full-text items. Most of them are
international STM publishers with Elsevier being number one in the list. This is not surprising as
Elsevier is the World’s largest STM publisher and it published 350,000 research articles in more than
2,000 journals in 2013 (Publishers Weekly, 2014). It is interesting to see the China National
Knowledge Infrastructure (CNKI) among the top ten. The retrieved articles in this study were all in
English and China is a non-English speaking country. China has become the second in scientific
publications (i.e. number of international articles) after USA since 2006 (Zhou & Leydesdorff, 2008).
China publishes 189 English journals and CNKI probably provides English bibliographic information
for articles in Chinese as many non-English journals published in non-English-Speaking countries also
have English Table of Contents and abstracts. GS results also include many items that are retrieved
from Google Books. From 1,044 books retrieved as part of the 8,310 items, 582 (55.7% of books)
were from Google Books. Regardless of the sources of books, 317 (30.4% of books) were available in
full-text.

Table 6. Frequency of sources of information for non-full-text items

Source of information N %

Elsevier 823 23.2

Google books 582 16.4

Wiley 346 9.8


Springer 189 5.3

Europe PMC 120 3.4

AIP Scitation 113 3.2

LWW 108 3.0

China Network of Knowledge Infrastructure (CNKI) 76 2.1

Taylor & Francis 71 2.0

10
Sage 69 1.9

ACS Publication 64 1.8

IEEE 58 1.6
Nature 47 1.3
Total 2,666 75

However, restricting the data to full-text articles only, individual websites that provide full-text files
are presented in Table 7. ResearchGate.net accounts for 10.5 (466) of all full-text articles. This is
interesting because ResearchGate is an academic social networking service or a reputation platform
for academics and not a repository. ResearchGate does not allow authors to link their publications to
full-text files hosted on other websites. Users can either upload their full-texts onto the site itself or
in some other cases ResearchGate finds the full-text on third parties (e.g. arXiv.org). This has turned
ResearchGate to a growing source of full-text articles. The Wiley online library and ScienceDirect are
respectively in the third and fifth place in Table 7. This might be because of the full OA journals or
hybrid OA journals these publishers publish. The existence of CiteseerX among the sources of full text
files is interesting because it is an academic search engine similar to Google Scholar, which also hosts
some files. The nine websites presented in Table 7 account for 32.2% of all full-text articles (1,428
out of 4,426 full-text articles).

Table 7. Frequency of sources of full-text articles

Source of full-text N %

researchgate.net 466 10.5

ncbi.nlm.nih.gov 286 6.5

onlinelibrary.wiley.com 176 4.0

arxiv.org 132 3.0

sciencedirect.com 120 2.7

citeseerx.ist.psu.edu 68 1.5

biomedcentral.com 62 1.4

nature.com 59 1.3

pnas.org 59 1.3
Total 1,428 32.2

11
Analysis of the top level domain (TLD) names of the URLs of full-text items reveals that
academic/educational websites account for the largest number of full-text sources (30.3%). This
category includes all sites with .edu, ac.xx, or other academic sites that do not follow these two
conventions (e.g. websites of German universities). The second top domain name is organizational
(.org) websites such as arxiv.org. Commercial websites such as publishers are the top third and
account for 21.2% of full-text items. There were also 41 unknown items

The distribution of the TLDs of full-text items (Table 8) shows that some TLDs are more common in
some subject categories. The most common TLD in life sciences (28.9%) and health sciences (28.5%)
is ‘.org’, while the most common TLD in physical sciences (43.2%) and social sciences (43.9%) is
educational domain names. There were about 80 IP numbers (e.g. 163.178.XXX.X) instead of URL
and the authors used reverse DNS look up to find their domain names for this analysis.

Table 8. Frequency of top level domains (TLDs) of full-text items by subject category

edu com org gov net Other total

Count 174 164 227 69 118 34 786


Life
sciences % within
22.1 20.9 28.9 8.8 15.0 4.3% 100
Subject
Count 679 232 309 51 190 111 1,572
Physical
Sciences % within
43.2 14.8 19.7 3.2 12.1 7.1% 100
Subject

Count 352 162 116 45 71 56 802


Social
Sciences % within
43.9 20.2 14.5 5.6 8.9 7.0% 100
Subject
Count 241 453 458 189 183 81 1,605
Health
Sciences % within
15.0 28.2 28.5 11.8 11.4 5.0% 100
Subject

total Count 1,446 1,011 1,110 354 562 282 4,765

% 30.3 21.2 23.3 7.4 11.8 5.9% 100

X2=522, df = 18, p < 0.001

Correlation of OA, citations and versions

Figure 2 shows the distribution of articles retrieved by year and by availability of full-text. Quite
naturally the closer to the recent time (2014) the fewer articles have been retrieved and the number
of articles from 2014 is very low as the data collection was done in early 2014. Figure 3 shows the
median of times cited for full-text and non-full-text articles by year. Median is used because there
were outliers and the distribution of citations was not normal. It is clear that the old articles have
received more citations due to a greater citation window. The only exception is 2014 in which

12
exceptionally the number of citations grows, but that should be interpreted with caution as the
number of items for 2014 was very small.

Figure 2. Number of full-text and non-full-text articles by year

13
Figure 3. Median of times cited for full-text and non-full-text articles by year

There was a weak positive correlation between number of versions and times cited for full-text
articles (r = 0.346, p < 0.001, n = 4,426), the same correlation existed for non-full-text articles but
much weaker (r =0.109 , p < 0.001, n = 2,818). There was also a positive correlation, but very weak,
between number of full-text versions and times cited for full-text articles (r = 0.166, p <0.001, n =
4.426). T test showed that there is significant difference in the number of times cited between full-
text items (M= 286, SD= 535) and non-full-text items (M=161, SD = 361) p <0.001, t (7450)= -11.9.

Subject differences also exist in the number of times each item is cited (Table 9, Figure 4). The
average number of citation is lowest in life sciences and highest in physical sciences. The comparison
of these data with the average citation by field based on the Web of Science (Essential Science
Indicators, ESI) reveal some differences. In ESI, life sciences, health sciences, physical sciences, and
social sciences have respectively the largest to lowest averages in citation. The reason might be that
these data only include highly cited documents (first 10 hits) and also GS captures citations that WoS
does not.

Table 9. Number of times cited (all items) by broad subject category

5% trimmed
Mean SD Median
mean
Life
238.2 178.5 457.3 112.5
sciences
Physical
382.2 313.1 202.9 112.0
sciences
Social 338.1 217.9 795.7 114.0

14
sciences
Health
282.2 183.6 681.1 100
sciences
Kruskal Wallis test: X2 = 4.93, df=3, p =0.177

Figure 4. Logarithmic distribution of number of times cited (all items) by broad subject category

In terms of the number of versions (Table 10), there is a statistically significant difference among
subject categories in terms of the number of versions available for each item. Social sciences have
the fewest number of versions available while items in health and life sciences have a larger number
of versions available on average. It should be noted that the number of versions shown under each
item on the main GS results page is an estimation and sometimes slightly differs from the actual
number of versions one can see if clicked on the ‘All X versions’ option. Discrepancy existed for 1,410
items between the number of versions mentioned by GS and the count of the actual number of
versions. For the remaining 6,900 items there was no difference. The difference, however, is not that
high overall with Mean difference being 0.2 (SD = 0.6) and median being zero.

15
In terms of the number of citation versions (Those marked as [CITATION]) there were significant
differences among subjects (Kruskal Wallis X2 = 75, df = 3, p < 0.001) in that life sciences had the
largest number of citation versions (M = 1.15) and health sciences had the lowest (M = 0.93). The
mean of times cited for physical sciences and social sciences were 1.10 and 0.96 respectively.

Table 10. Number of versions (all items) by broad subject category

5% trimmed
Mean SD Median
mean
Life
8.87 8.21 6.33 7
sciences
Physical
8.69 7.80 7.73 7
sciences
Social
7.72 6.48 9.90 5
sciences
Health
9.18 8.16 8.60 7
sciences
Kruskal Wallis test: X2 = 195.6, df=3, p < 0.001

Discussion
This study used queries to retrieve articles from Google Scholar and examined the availability of full-
text, versions and times cited in four broad subject categories. The method used in the study has
some limitations that should be taken into account while interpreting the results.

Although searches were restricted to items published between 2004 and 2014, the range time does
not seem to work accurately in GS; the results will be more accurate if one searches year-by-year.
This however would take a lot more time to do. All the queries were performed in English and this is
a limitation for articles written in other languages (some quite common in social sciences and
humanities) and therefore English queries might have introduced some bias.

Attempts were made to create queries so that fewer books would be retrieved as part of the search
results. This was because the focus of the study was on journal articles. However, the search results
still included 12.6% books. This figure is lower than that of Martin-Martin et al (2014) (18%) and the
reason might be the bias we intentionally had toward journal articles. The impact of this decision is
probably more on some areas such as social sciences where books are more popular. As a reviewer
suggested, a better way to avoid books in the results is retrieving results beyond number 10 in the
search engine results page. For example, if within top ten, two books are found, then the researcher
looks until 12th result and ignores the two books.

GS claims that it takes into account how often and how recently a document has been cited in other
scholarly literature1 when ranking it. In this study only first 10 hits for each query were included in

1
https://scholar.google.com/intl/en/scholar/about.html

16
the analysis and this clearly biased the data towards more highly cited documents. All of the findings
might be different if the study is repeated with a focus on the non-cited documents.

The volumes of the journal articles in the four broad subject areas are not equal. The sample size for
each subject in this study was based on the number of sub-fields each subject has in Scopus subject
categorization. Therefore, the researchers cannot maintain that the size of the sample for each
subject reflects the real volume of the literature (journal articles) in that subject. However, Scopus
subject categories reflect the specialization of the subject areas and it is logical to use it as a basis for
sampling.

This study found that about 61.1% of articles were available in full-text using GS. This is higher than
40% found by Martin-Martin et al. (2014) and lower than 70% found by Pitol and De Groote (2014).
However, the study settings were different in all these studies. For example, the time frame was
limited to 2004-2014 in the current study. Other studies have found different figures such as 48% in
Archambault et al. (2013) and 24% in Khabsa and Giles (2014). PDF file format accounted for 69.2%
of full-text items while this number was 80% in Martin-Martin et al (2014). Four-fifths of full-text
articles were publisher versions.

Similar to Martin-Martin et al. (2014) educational domains and nonprofit organizations were the
main sources of full-text items presented in GS. In both studies, ResearchGate, an academic social
network, turned out to be a main source (number one in this study and two in Martin-Martin et al.)
for full-text items and this number is expected to increase as the number of ResearchGate users is
increasing and the site claims that its users upload two million publications every two months2.
Clearly academics use their academic personal homepages or institutional repositories or social
networks to make their output available in full-text for their peers. For future research, longitudinal
studies need to investigate whether the availability of full-text increases over time and whether the
role of social networks such as ResearchGate as a source of full-text articles becomes more
important. One should also note that sometimes full-texts included in ResearchGate are not indexed
on Google Scholar for a long time; or even ResearchGate version will not be the primary version in
GS. This means that the presence of ResearchGate may be even greater. The sample in this study is
focused on the primary versions of records sorted by relevance (generally most cited).

Non-full-text items mainly are from STM publishers such as Elsevier, Springer and Wiley. These
publishers also account for a number of full-text articles as they publish some full OA journals and
hybrid OA journals.

Differences exist among subject categories in terms of the rate of OA, number of versions, times
cited and sources of full-text articles. Among subject categories, life sciences have the highest rate of
full text items. More full-text items in health sciences and life sciences were publisher versions than
in physical sciences and social sciences. These finding are in line with those of Björk et al. (2010).
They found that OA journals were more popular in medical sciences.

2
https://explore.researchgate.net/display/news/2014/08/13/Celebrating+five+million+members+wi
th+free+DOIs

17
There was a statistically significant difference between the means of times cited of full-text items
and non-full-text items as full-text items have received more citations on average. There were also
positive correlations, but very week, between the number of versions and the number of full-text
versions of an item and the number of times it has been cited. Pitol and De Groote (2014) and
Martin-Martin et al (2014) found no such correlations in their study.

Conclusions
Returning to the objectives of the study, the findings of the study showed that 61.1% of the articles
were available freely, of which 80.8% were publisher version and another 14.4% were Green OA
(preprints). Life sciences had the highest rate of full-text availability (66.9%). Publishers’ (including
Elsevier, Wiley, Springer etc) websites were a main source of bibliographic information for non-full-
text articles. For full-text items, educational domains accounted for almost a third of the items and
org and com domains were the source of 23.3 and 21.2 per cent of items respectively. Looking at
single platforms, ResearchGate was the top single source of full-text articles, providing 10.5% of all
full-text articles and nih.gov accounted for another 6.5%. However, some publishers’ websites (such
as Wiley online library and ScienceDirect) were present among the top sources of full-text items.
There was a weak positive correlation between number of versions and times cited for full-text
articles. There was also a positive correlation, but very weak, between number of full-text versions
and times cited for full-text articles.

Overall, the study contributed in the methodology of knowing GS and also shed some light on GS as
a black box. The study showed the sources of items (both full-text and non-full-text) in GS and the
differences in terms of number of versions and times cited in different disciplines. The Rate of full-
text availability is high, at least for highly cited articles, and there are subject differences in the rate
of OA.

Acknowledgement
The study was funded by Kharazmi University and the senior author would like to thank the LIS
department of Tabriz University for hosting him during his sabbatical while working on this study.

References
Aguillo, I. F. (2012). Is Google Scholar useful for bibliometrics? A webometric
analysis. Scientometrics, 91(2), 343-351. doi: 10.1007/s11192-011-0582-8.
Antelman, K. (2004). Do open-access articles have a greater research impact? College & Research
Libraries, 65 (5), 372-382. Doi: 10.5860/crl.65.5.372
Archambault, E., Amyot, D., Deschamps, P., Nicol, A., Rebout, L. & Roberge, G. (2013) Proportion of
open Accesspeer-reviewed papers at theEuropean and world levels—2004-2011, Science-
Metrix. Report.Science matrix Inc. http://www.science-
metrix.com/pdf/SM_EC_OA_Availability_2004-2011.pdf Accessed 10 October 2013.
Bar-Ilan, J. (2008). Which h-index?-a comparison of WoS, Scopus and Google
Scholar. Scientometrics, 74 (2), 257-271. doi: 10.1007/s11192-008-0216-y

18
Björk, B. C., Welling, P., Laakso, M., Majlender, P., Hedlund, T., & Gudnason, G. (2010). OpenAccess
To The Scientific Journal Literature: Situation 2009. PLoS ONE, 5 (6): e11273.
doi:10.1371/journal.pone.0011273
Brown, C. M. (2001). The E-volution of Preprints in the Scholarly Communication of Physicists and
Astronomers. Journal of the American Society for Information Science and Technology, 52 (3),
187–200.
Charles, L., & Booth, H. A. (2011). An Overview of Open Access in the Fields of Business and
Management. Journal of Business and Finance Librarianship, 16(2), 108-124. doi:
10.1080/08963568.2011.554786
Christianson, M. (2007). Ecology articles in Google Scholar: levels of access to articles in core
journals. Issues in Science and Technology Librarianship, doi: 10.5062/F4MS3QPD
Cothran, T. (2011). Google Scholar acceptance and use among graduate students: A quantitative
study. Library and Information Science Research, 33 (4), 293-301. doi:
10.1016/j.lisr.2011.02.001
de Winter, J. C., Zadpoor, A. A., & Dodou, D. (2014). The expansion of Google Scholar versus Web of
Science: a longitudinal study. Scientometrics, 98 (2), 1547-1565. doi: 10.1007/s11192-013-
1089-2
Eysenbach, G. (2006). Citation advantage of open access articles. PLoS biology, 4 (5): e157. doi:
10.1371/journal.pbio.0040157
Falagas, M. E., Pitsouni, E. I., Malietzis, G. A., & Pappas, G. (2008). Comparison of PubMed, Scopus,
web of science, and Google scholar: strengths and weaknesses. The FASEB Journal, 22 (2), 338-
342.
Gargouri, Y., Larivière, V., Gingras, Y., & Harnad, S. (2012). Green and Gold Open Access Percentages
and Growth, by discipline.In E. Archambault, Y. Gingras, & V. Larivière. (Eds), Proceedings of
17th International Conference on Science and Technology Indicators, Montréal: Science-
Metrix and OST.http://sticonference.org/Proceedings/vol1/Gargouri_Green_285.pdf.
Accessed 10 October 2013.
Hartman, K., & Mullen, L. (2008). Google Scholar and academic libraries: An update.New Library
World, 109(5/6), 211-222. doi: 10.1108/03074800810873560
Harzing, A. W. (2014). A longitudinal study of Google Scholar coverage between 2012 and
2013. Scientometrics, 98 (1), 565-575. doi: 10.1007/s11192-013-0975-y
Haya, G., Nygren, E., & Widmark, W. (2007). Metalib and Google Scholar: A user study.Online
Information Review, 31 (3), 365-375. doi: 10.1108/14684520710764122
Howland, J. L., Wright, T. C., Boughan, R. A., & Roberts, B. C. (2009). How scholarly isGoogle Scholar?
A comparison to library databases. College and Research Libraries, 70 (3), 227-234. doi:
10.5860/crl.70.3.227
Jacsó, P. (2005). Google Scholar: the pros and the cons. Online information review, 29(2), 208-21.
doi: 10.1108/14684520510598066
Jacsó, P. (2008). Google scholar revisited. Online Information Review, 32 (1), 102-114. doi:
10.1108/14684520810866010
Khabsa M, Giles CL (2014) The Number of Scholarly Documents on the Public Web. PLoS ONE 9(5):
e93949. doi:10.1371/journal.pone.0093949

19
Kousha, K., & Thelwall, M. (2007). Google Scholar citations and Google Web/URL citations: A
multi‐discipline exploratory analysis. Journal of the American Society for Information Science
and Technology, 58 (7), 1055-1065. doi: 10.1002/asi.20584
Kousha, K., & Thelwall, M. (2008). Sources of Google Scholar citations outside the Science Citation
Index: A comparison between four science disciplines. Scientometrics, 74 (2), 273-294. doi:
10.1007/s11192-008-0217-x
Kulkarni, A. V., Aziz, B., Shams, I., &Busse, J. W. (2009). Comparisons of citations in Web of Science,
Scopus, and Google Scholar for articles published in general medical journals. JAMA, 302 (10),
1092-1096. doi: 10.1001/jama.2009.1307
Lercher, A. (2008). A survey of attitudes about digital repositories among faculty at Louisiana State
University at Baton Rouge. The Journal of Academic Librarianship, 34 (5), 408-415.
doi:10.1016/j.acalib.2008.06.008
Martín-Martín, A.; Orduña-Malea, E.; Ayllón, J.M.; Delgado López-Cózar, E. (2014). Does Google
Scholar contain all highly cited documents (1950-2013)? EC3 Working Papers, 19.
http://arxiv.org/abs/1410.8464. accessed 25 March 2015.
Meho, L. I., & Yang, K. (2007). Impact of data sources on citation counts and rankings of LIS faculty:
Web of Science versus Scopus and Google Scholar. Journal of the American Society for
Information Science and Technology, 58 (13), 2105-2125. doi: 10.1002/asi.20677
Miguel, S., Bongiovani, P. C., Gómez, N. D., & Bueno-de-la-Fuente, G. (2013). Prospect for
Development of Open Access in Argentina. Journal of Academic Librarianship, 39(1), 1-2. doi:
10.1016/j.acalib.2012.10.002
Neuhaus, C., Neuhaus, E., Asher, A., & Wrede, C. (2006). The depth and breadth of Google Scholar:
An empirical study. Portal: Libraries and the Academy, 6 (2), 127-141. doi:
10.1353/pla.2006.0026
Norris, M., Oppenheim, C., Rowland, F. (2008). The Citation Advantage of Open Access Articles.
Journal of the American Society for Information Science and Technology 59 (12), 1963–1972.
Noruzi, A. (2005). Google Scholar: The new generation of citation indexes. Libri, 55 (4), 170-180. doi:
10.1515/LIBR.2005.170
Ollé, C. & Borrego, A. (2010). A qualitative study of the impact of electronic journals on
scholarlyinformation behavior, Library & Information Science Research, 32 (3), 221-228. doi:
10.1016/j.lisr.2010.02.002
Orduña-Malea, E., Ayllón, J. M., Martín-Martín, A., & López-Cózar, E. D. (2014). About the size of
Google Scholar: playing the numbers. arXiv preprint.
https://dspace.muni.cz/jspui/bitstream/ics_muni_cz/1022/1/ZatrochovaEsej.pdf. Accessed 10
October 2013.
Pitol , S. P., & De Groote , S. L. (2014). Google Scholar versions: do more versions of an article mean
greater impact?.Library Hi Tech, 32 (4), 594-611. doi: 10.1108/LHT-05-2014-0039
Sanni, S. A., & Zainab, A. N. (2010). Google Scholar as a source for citation and impact analysis for a
non-ISI indexed medical journal. Malaysian Journal of Library & Information Science, 15(3), 35-
51.
Wagner, A. B. (2010). Open Access Citation Advantage: An Annotated Bibliography. Issues in Science
and Technology Librarianship, doi: 10.5062/F4Q81B0W.
Walters, W. H. (2007). Google Scholar coverage of a multidisciplinary field. Information Processing &
Management, 43 (4), 1121-1132. doi: 10.1016/j.ipm.2006.08.006
20
Publishers Weekly (2014, 27 June). Global Publishing Leaders 2014: Reed Elsevier.
http://www.publishersweekly.com/pw/by-topic/industry-news/publisher-
news/article/63099-global-publishing-leaders-2014-reed-elsevier.html Accessed 03 April
2015.
Zhou, P. & Leydesdorff, L. (2008). China Ranks Second in Scientific Publications since 2006, ISSI
Newsletter, Nr. 13, 7-9.

21

View publication stats

You might also like