Professional Documents
Culture Documents
www.emeraldinsight.com/0264-0473.htm
EL
37,2 Study of Asian RDR based
on re3data
Jane Cho
Department of Library and Information Science, Institute of Social Science,
302 University of Incheon, Incheon, Republic of Korea
Received 28 January 2019
Revised 15 March 2019
23 March 2019 Abstract
Accepted 26 March 2019
Purpose – RDR has become an essential academic infrastructure in an atmosphere that facilitates the
openness of research output granted by public research funds. This study aims to understand operational
status of 152 Asian data repositories on re3data and cluster repositories into four groups according to their
operational status. In addition, identify the main subject areas of RDRs in Asian countries and try to
understand what topic correlations exist between data archived in Asian countries.
Design/methodology/approach – This study extracts metadata from re3data and analyzes it in various
ways to grasp the current status of research data repositories in Asian countries. The author clusters the
repositories into four groups using hierarchical cluster analysis according to the level of operation. In
addition, for identifying the main subject areas of RDRs in Asian countries, extracted the keywords of the
subject field assigned to the each repository, and Pathfinder Network (PFNET) analysis is performed.
Findings – About 70 per cent of the Asian-country repositories are those where licenses or policies are
declared but not granted permanent identifiers and international-level certification. As a result of the subject
domain analysis, eight clusters are formed centering on life sciences and natural sciences.
Originality/value – The research output in developing countries, especially non-English-speaking
countries, tends not to be smoothly circulated in the international community due to the immaturity of the
open-access culture, as well as linguistic and technical problems. This study has value, in that it investigates
the status of Asian countries’ research data management and global distribution infrastructure in global
open-science trends.
Keywords Open access, Asia, Open science, Research data repositories, RDR
Paper type Research paper
Introduction
Research data are a by-product of research in a variety of media, including statistical
records, sound sources, media and images, as well as factual data based on observation,
experimentation and experience (DataCite, 2012). Correctly generated and curated data
sets can be re-analyzed to verify the results of the research, to improve the transparency
and reliability of the research process or to answer other research questions. The
published research data can be used as a catalyst for science and technology
development, enhancing mutual trust between society and science as an asset of
humanity as a whole.
Trustworthy research data repositories (RDRs) are needed to ensure that research data
are stored and published reliably. An RDR is a sustainable information infrastructure that
provides long-term storage and research data access. It can be said that it is an essential
element of the research infrastructure used by the scientific community to carry out the
The Electronic Library
highest level of research in each field. Recently, it has become essential in an atmosphere
Vol. 37 No. 2, 2019
pp. 302-313
that facilitates the openness of research data granted by public research funds (OECD, 2013).
© Emerald Publishing Limited The National Science Foundation (NSF), Research Councils UK (RCUK) and other research
0264-0473
DOI 10.1108/EL-01-2019-0016 support agencies are strengthening policies to deposit research results output granted from
public funds in reliable repositories (STEPI, 2017). Therefore, the criteria and the Asian RDR
certification system for securing the reliability of RDRs are becoming important issues. based on
According to the RDR directory re3data (re3data.org), 2,013 data repositories are
registered worldwide. However, there are only 210 repositories that have been certified by
re3data
the World Data System (WDS) and the Data Seal of Approval (DSA). In Asia, only nine in
China and eight in Japan were certified as trusted repositories. Asian countries still do not
have robust open-access policies as compared to developed countries in the Western
hemisphere, and the technical infrastructure for them is weak. However, the problem in 303
which Asian research results are being missed globally can hinder balanced development
through the smooth flow of knowledge in international society.
Therefore, this study attempts to understand the operational status of 152 Asian data
repositories on re3data and clusters repositories into four groups according to their
operational status. In addition, this study identifies the main subject areas of RDRs in Asian
countries and tries to understand what topic correlations exist between research data
archived in Asian countries. By doing so, this study seeks to understand the global
distribution and recycling potential of Asian-originated research findings in global open-
science trends.
Background
Key requirements for reliable data storage
Research funding agencies are requiring researchers disclose and manage research data.
The results of a researcher who received a public grant must be stored in a reliable data
repository. Research data must be properly managed, curated and archived so that research
results can be verified and reused. In 2017, the International Council of Scientific Union’s
World Data System (ICSU WDS) and DSA established CoreTrustSeal (www.coretrustseal.
org), the data repository’s certification body. CoreTrustSeal is a non-profit organization
dedicated to improving the sustainability and reliability of data repository infrastructures
and operates a certification system based on a unified requirement to certify a reliable data
repository.
DSA and WDS each have certification standards for data storage for long-term
preservation and delivery of scientific data, respectively. However, they share a perception
that simple and short-term certification criteria may be needed for easy adjusting and
modifying, so they established CoreTrustSeal.
The evaluation index for certification can be summarized as follows: organizational
infrastructure (mission/scope, licenses, continuity of access, confidentiality/ethics,
organizational infrastructure and expert guidance), digital object management (data
integrity and authenticity, appraisal, documented storage procedures, preservation plan,
data quality, workflows, data discovery and identification and data reuse) and technology
(technical infrastructure and security).
re3data
The re3data.org is a registry for repositories in 67 countries around the world. This is the
result of a research project funded by the German Research Foundation (DFG) from 2012 to
2015. Since January 2016, it has been operated as a service of DataCite. Re3data has a
comprehensive set of metadata for 42 attributes to index and describe RDRs.
Re3data provides the following information: first, general information about the RDR,
such as the repository name, URL and scope; second, information about the RDR authority,
such as the name, type, location and type of responsibility; third, legal aspects, such as
access and data upload policies; and fourth, technical aspects, such as persistent identifier
EL systems, application programming and interfaces. All re3data records are persistently
37,2 accessed and cited via a digital object identifier (DOI) and provide an application
programming interface (API).
The following is a brief summary of the global RDR status through re3data. First, the field
of RDR is dominated by the life and natural sciences. Second, there are 947 RDRs in the USA,
280 in the UK, 182 in the EU and 142 in Canada, accounting for a large portion of the USA and
304 Europe. In Asia, there are total of 152 repositories, including 37 in China, 57 in Japan, 29 in India
and six in Korea. Third, there are 1,710 subject repositories and 503 institutional
repositories, so subject repositories take up a larger share. Fourth, there are many types
of content, including standard office documents and images, in addition to scientific and
statistical data formats, and so on. Fifth, the total number of certified repositories is 210,
and the types of certification are classified into WDS, DSA and CoreTrustSeal. Fifth, RDR
software varies, but the most commonly used is DSpace, and there are other cases using
DataVerse developed by Harvard University. Sixth, the most common metadata standard
is Dublin Core, and there are cases that use the data documentation initiative (DDI) and
DataCite metadata schema. In the case of persistent identifier (PID), most of the
institutions do not use one, but DOI was frequently used among agencies that adopted
PIDs. Seventh, most of the data licenses were declared as copyrights, but there were
many cases that were disclosed as creative commons (CC) and public domain.
Precedent research
There are not many studies about the re3data registry, but it can be summarized as follows.
Pampel et al. (2013) explained how to identify the appropriate repository for research data
archiving via re3data.org. They noted that re3data can effectively identify the strengths and
weaknesses of the repository infrastructure by providing an icon system that utilizes
indexes about persistent identifiers, open access, policy, certification and so forth. Kindling
et al. (2017) analyzed 1,381 RDRs registered at re3data.org by 2015. They analyzed the
repository operating agency, access conditions and service levels and noted that the level of
service provided by the RDRs varies across subject areas.
On the one hand, research on data repositories has been conducted using re3data at a
certain national level. A study was conducted on 247 UK data repositories registered in
re3data (Zhang et al., 2017), analyzing the subject areas of the data repositories and the
functions of the platforms. In addition, a study was conducted to analyses the status of data
repositories by collecting re3data for Korea, China and Japan (Kim, 2018). This study
analyzed the differences between the three East Asian countries with regard to
organizational type, quality control and subject areas of the repositories. As a result, it was
found that most of them are similar in terms of version and quality management. However,
data repositories are active in Korea for earth science and Japan for physics, while China is
more active in the life sciences.
On the other hand, there was a research study on a proposal for metadata schema by
collecting and analyzing repository technical information for re3data.org (Kim and Choi,
2017). In this study, by developing a crawler program, necessary data were collected from
re3data.org. The authors noted problems with missing essential elements and values, as
well as an insufficient controlled vocabulary.
Research method
This study extracts metadata for the RDR of Asian countries on re3data and analyzes them
by various methods. The extraction target is 152 research data repositories of Asian
countries, including China (37), Hong Kong (one), India (31), Indonesia (two), Japan (57),
South Korea (six), North Korea (one), Pakistan (one), Singapore (four), Taiwan (nine), the Asian RDR
Philippines (one), Thailand (one) and Turkey (one). Specifically, this study extracts and based on
analyzes attributes of RDR, such as content type, subject, keyword, data license, PID,
repository system, metadata and API. Even though re3data gives plural attributes in
re3data
metadata, this study used the first representative attributes.
In addition, the re3data icon system is used to extract data about whether the repository
is open access, releases licenses, provides a permanent identification system, is being
certified and whether it discloses policies. The data were converted into binary code and 305
used to identify the operational level of the repository.
The analysis method of the extracted data is as follows. First, this study identifies the
general state of repository operations across Asian repositories through policy conditions,
the adoption of technical standards and so on. Second, hierarchical cluster analysis is
performed based on the binary attributes extracted from the icon system. With these
attributes, this study determines the level of operation of the repositories in Asian countries
by forming clusters. All the analysis described above has been done through SPSS 32.
Third, the keywords of the subject field assigned to the repository are extracted, and a
weighted network analysis is performed. A total of 671 keywords are extracted, up to five
per repository, and a co-occurrence matrix is calculated for 37 keywords that appeared more
than five times each. This study calculated the Pearson correlation coefficient between word
pairs and applied it to the PFNET. In addition, parallel nearest neighbor clustering (PNNC)
was used to form keyword clusters, all of which were processed using WNET (http://cafe.
daum.net/wnets) as developed by Lee Jae Yoon. The PFNET was created by removing paths
that violated triangular inequalities with all weighted links created. In addition to effectively
expressing the entire structure, such as the traditional methods of multidimensional scale
and cluster analysis, a PFNET analysis is able to express a detailed structure more clearly
(Lee, 2006a). PNNC can effectively represent clusters on the PFNET (Lee, 2006b).
No. (%)
Type
Disciplinary 117 78
Disciplinary þ institutional 16 11
Institutional 13 9.0
Other 6 3.3
Data type
Scientific and statistical data formats 101 16
Images 81 13
Standard office documents 81 13
Structured graphics 73 11
Plain text 63 10
Raw data 62 10
Structured text 30 4.9
Other 28 4.6
Databases 25 4.1
Software applications 22 3.6
Audiovisual data 18 2.9
Archived data 15 2.4
Network-based data 12 2.0
Table II. Configuration data 2 0.3
Types of repositories Plain texts 1 0.2
and types of data Source code 1 0.2
Sustainable Humanosphere and Japan Society for the Promotion of Science has received
WDS certification (Table III).
Fourth, most repositories did not adopt PIDs. However, when PIDs were adopted, DOI
was the most common at 19 (13 per cent). Repository systems tend to use mostly unknown
systems or develop systems in-house. However, among the known systems, there are three
cases where DataVerse is used. In the case of representative metadata type, there were not
many institutions that responded, but there were repositories using DC, DDI or DataCite.
DDI was developed by the ICPSR (Inter-university Consortium for Political and Social
Research) as a data standard for social science in 1995. It was adopted by Peking University Asian RDR
Open Research Data and DR-NTU (Data) of Nanyang Technological University (Table IV). based on
Fifth, when it comes to open access, 99 per cent of the 150 repositories are open access,
and nearly all repositories have licenses, including terms and conditions of use. The licenses
re3data
for the CC series were declared in 21 repositories and copyright in 62 repositories. In
addition, 80 per cent of the Asian countries’ repositories were found to disclose policies, such
as preservation and use policies, open access, information protection, ownership and
responsibility, data protection, codes of ethics and data sharing (Table V). 307
No. (%)
PID
None 92 61
DOI 19 13
Unanswered 18 11
Other 17 11
HDL 3 2.0
ARK 2 1.3
CC 1 0.7
Software
Unknown/other 84 55
None/unanswered 59 39
MySQL 4 2.6
DataVerse 3 2.0
DSpace 2 1.4
Metadata
Unanswered 102 67
Unknown/other 25 17
DataCite 3 2.0
DDI 3 2.0
Dublin Core 3 2.0
FGDC/CSDGM 3 2.0
ISO 19115 3 2.0
ISA-Tab 2 1.3
Repository-developed metadata schemas 2 1.3
ABCD 1 0.7
Table IV.
Darwin Core 1 0.7
DIF 1 0.7 PID, application
EML 1 0.7 software and
MIBBI 1 0.7 metadata adoption
RDF Data Cube Vocabulary 1 0.7 status
EL No. (%)
37,2
Open
Yes 150 99
No 2 1.3
License
308 Copyright 62 41
Other 51 34
CC 21 14
Unanswered 18 12
Table V.
Open access, Policy
licensing status and Yes 122 80
policy disclosure No 30 19
Korea and Taiwan each have one repository included. By contrast, there are 25 repositories
that belong to Cluster 4, which are not yet certified but have PIDs, with Japan taking the
largest share, followed by China and India.
Finally, Cluster 1, which is a group of open access-based repositories with licenses and
policies but no PIDs and an uncertified status. Cluster 1, which has the largest number of
repositories at 70 per cent, includes 24 of 37 in China and 27 of 31 in India. In Japan, there are
37 of 57, six of nine in Taiwan and five of six in Korea. In addition, Indonesia, North Korea,
Pakistan, Turkey, the Philippines and other Asian countries are classified in Cluster 1.
To sum up, most of Asia’s data repositories are open to policy and licensing and run
on an open-access basis. However, it is not common that a permanent identifier is given
or internationally certified. In addition, it was determined that mostly unknown
systems are used, and only a few institutions have adopted known metadata, such as
DataCite, Dublin Core and DDI. It is found that 70 per cent of the Asian countries have
ordinary level repositories, but there are excellent repositories in China and Japan that
meet all indicators.
The result of the PFNET analysis was visualized by Nodexl like (Figure 1). The clusters on the
map indicate that the topics are relatively clearly distinguished. Therefore, it can be inferred
that the RDR is operated in the form of a subject repository that specializes in a specific subject
rather than covering multidisciplinary subjects. The life science (c2) cluster has biochemistry,
genetic engineering and medicine as subdomains, and animal science (c6) and botanical science
(c7) are adjacent to life science. Therefore, it is probable that the sub-areas of life sciences are
closely related to each other and are managed in a comprehensive manner.
The general area of natural science (centered on physics, geography and earth science) is
located on the right side in the map in Figure 1. This area also occupies an important part of
the map and shows a correlation. On the other hand, humanities and social sciences (c4) are
spreading at the top of the map, suggesting that they are highly likely to operate
independently of the life and natural sciences.
Asian RDR
based on
re3data
311
Figure 1.
Subject clusters in
Asian country data
repositories
EL Conclusion
37,2 Research results in developing countries, especially non-English-speaking countries, tend
not to be smoothly circulated in the international community due to the immaturity of the
open-access culture, as well as linguistic and technical problems. The problem of missing
Asian research results can be an impediment to the balanced development of international
society through the flow of knowledge. Therefore, this study investigated Asian countries’
312 research data management and distribution infrastructure problems under global open
science.
As a result of this study, it was found that the repositories of Asian countries were
registered in re3data in 13 countries, and many repositories were registered in Japan, China
and India. Most of them were in the form of subject repositories rather than institutional
units, and there were many cases in which they participate in the form of a consortium with
Western developed countries. Second, it was determined that mostly unknown systems are
used. Only a small number of institutions were given permanent identifiers, and known
metadata were also used. Third, the most common types of Asian country repositories were
those where licenses or policies were open, but were not granted permanent identifiers and
operate in an uncertified state. This type accounted for 70 per cent of the total. On the other
hand, it was found that there are three repositories in Japan and China that show an
excellent level with all conditions. Fourth, in analyzing the subject topic of the Asian
countries’ repositories, eight clusters were created, and life sciences and natural sciences
were found to be mainstream. This seems to follow the global tendency in which life science
has overwhelming influence.
Meantime, the precedent research related to an RDR has been only at the statistical level
of analysis. However, this study revealed the overall level of Asian countries and grasped
the gaps between countries, by clustering the data repositories using various indicators.
Interest in data repositories has begun to emerge in developed countries of the West, after
policies to deposit research data in trustworthy repositories were launched. However, there
are not many trustworthy research repositories in Asian countries where open-access policy
is less mature. This study found that existing Asian RDRs do not meet the level of operation
required by the international community. As these findings have not yet been published
before in precedent studies, this study has originality.
To keep pace with international society based on open science and to promote balanced
development through global knowledge exchange, it will be necessary to develop reliable
research data repositories in Asian countries. Although this study analyzed data repositories of
Asian countries based on re3data, future research will also be needed to analyses the form, level
and possibility of the recycling of actual data sets archived in repositories. While it is important
to evaluate the reliability and sustainability of the data repository, it is also important to
evaluate the reliability of the archived data sets within it. Therefore, in subsequent study, in-
depth analysis should be performed on the archived datasets.
References
DataCite (2012), “Business models principles”, available at: www.datacite.org/documents/Business_
Models_Principles_v1.0.pdf (accessed 2 December 2018).
Kim, S. (2018), “Global data repository status and analysis: based on Korea, China and Japan data in
re3data.org”, International Journal of Knowledge Content Development and Technology, Vol. 8
No. 1, pp. 79-89.
Kim, S. and Choi, M. (2017), “Functional requirements for research data repositories”, International
Journal of Knowledge Content Development and Technology, Vol. 27 No. 2, pp. 41-51.
Kindling, M., Pampel, H., Sandt, S., Rücknagel, J., Vierkant, P., Kloska, G., Witt, M., et al. (2017), “The Asian RDR
landscape of research data repositories in 2015: a re3data analysis”, D-Lib Magazine, Vol. 23
Nos 3/4. based on
Lee, J.Y. (2006a), “A study on the network generation methods for examining the intellectual structure re3data
of knowledge domains”, Journal of the Korean Society for Library and Information Science,
Vol. 40 No. 2, pp. 333-355.
Lee, J.Y. (2006b), “A novel clustering method for examining and analyzing the intellectual structure of a
scholarly field”, Korea Society for Information Management, Vol. 23 No. 4, pp. 215-231. 313
Organization for Economic Co-operation and Development (OECD) (2013), “OECD principles and
guidelines for access to research data from public funding”, available at: www.oecd.org/sti/sci-
tech/38500813.pdf (accessed 9 September 2018).
Pampel, H., Vierkant, P., Scholze, F., Bertelmann, R., Kindling, M., Klump, J., Goebelbecker, H., et al.
(2013), “Making research data repositories visible: the re3data.org registry”, PLoS One, Vol. 8
No. 11, p. e78080.
Science and Technology Policy Institute (STEPI) (2017), “Expansion of open science policy and
implications”, STEPI Insight, Vol. 216, pp. 2-38.
Zhang, S., Huang, G. and Geng, Q. (2017), “Research on UK scientific data publishing platforms based
on Re3data”, Digital Library Forum, Vol. 6, pp. 16-24.
Corresponding author
Jane Cho can be contacted at: chojane123@naver.com
For instructions on how to order reprints of this article, please visit our website:
www.emeraldgrouppublishing.com/licensing/reprints.htm
Or contact us for further details: permissions@emeraldinsight.com