You are on page 1of 21

The current issue and full text archive of this journal is available on Emerald Insight at:

https://www.emerald.com/insight/2514-9288.htm

Data science and its relationship to Data science


and
library and information science: information
science
a content analysis
Sirje Virkus 643
School of Digital Technologies, Tallinn University, Tallinn, Estonia, and
Received 27 July 2020
Emmanouel Garoufallou Revised 28 August 2020
Department of Library Science, Archives and Information Systems, Accepted 28 August 2020

School of Social Sciences, International Hellenic University, Thessaloniki, Greece and


Deltos Group, Thessaloniki, Greece

Abstract
Purpose – The purpose of this paper is to present the results of a study exploring the emerging field of data
science from the library and information science (LIS) perspective.
Design/methodology/approach – Content analysis of research publications on data science was made of
papers published in the Web of Science database to identify the main themes discussed in the publications from
the LIS perspective.
Findings – A content analysis of 80 publications is presented. The articles belonged to the six broad
categories: data science education and training; knowledge and skills of the data professional; the role of
libraries and librarians in the data science movement; tools, techniques and applications of data science; data
science from the knowledge management perspective; and data science from the perspective of health sciences.
The category of tools, techniques and applications of data science was most addressed by the authors, followed
by data science from the perspective of health sciences, data science education and training and knowledge and
skills of the data professional. However, several publications fell into several categories because these topics
were closely related.
Research limitations/implications – Only publication recorded in the Web of Science database and with
the term “data science” in the topic area were analyzed. Therefore, several relevant studies are not discussed in
this paper that either were related to other keywords such as “e-science”, “e-research”, “data service”, “data
curation”, “research data management” or “scientific data management” or were not present in the Web of
Science database.
Originality/value – The paper provides the first exploration by content analysis of the field of data science
from the perspective of the LIS.
Keywords Education, Content analysis, Literature review, Data science, Information science, Library science
Paper type Research paper

1. Introduction
Data science, having emerged in response to the increased amount of data, has received
considerable attention in recent years. For example, the Web of Science database has
recorded 2,350 publications in the topic area “data science” in the period 1980–2019 by April
2019. And, 44.8% of publications came from the subject area of computer science, followed by
engineering (18.2%), mathematics (7.4%), science technology and other topics (5.6%),
business economic (4.6%), education and educational research (3.6%), information science
and library science (3.4%), physics (2.9%), telecommunications (2.9%), medical informatics
(2.7%), materials science (2.6%) and operations research and management science (2.6%)
Data Technologies and
(Virkus and Garoufallou, 2019a). At 3.4%, the library and information science (LIS) Applications
contribution to data science in the period 1980–2019 was limited. Vol. 54 No. 5, 2020
pp. 643-663
This paper presents the results of the study that explored the field of data science from the © Emerald Publishing Limited
2514-9288
LIS perspective using content analysis. The structure of this paper is as follows: section 2 DOI 10.1108/DTA-07-2020-0167
DTA describes the research methodology, section 3 presents the results of the content analysis on
54,5 the data science from the LIS perspective, section 4 draws conclusions.

2. Methodology
This paper presents the research results that are a part of the study that explored the data
science from the LIS perspective. The following research questions were proposed: (1) what
644 are the main tendencies in publication years, document types, countries of origin, source
titles, authors of publications, affiliations of the article authors and the most cited articles
related to data science in the field of LIS? (2) What are the main themes discussed in the
publications from the LIS perspective?
In the first stage, the bibliographic analysis was made on the basis of papers published in
the Web of Science database. Searches were carried out in the database by topic in April 2019
using the term “data science”. The search strategy discovered 80 publications published from
1980–2019. A statistical descriptive analysis of these papers provided answers to the first
research question and was presented by Virkus and Garoufallou (2019a). The methodology
and research approach were also tested in another paper by Virkus and Garoufallou (2019b),
which dealt with data science from a perspective of computer science.
In the second stage the content analysis of 80 publications was made to answer the second
research question: What are the main themes discussed in the publications from the LIS
perspective? This paper will provide an answer to the second research question.
The research has two limitations: (1) only publication recorded in the Web of Science
database and (2) only publications with the term “data science” in the topic area of the Web of
Science database were analyzed. Therefore, several relevant studies are not discussed in this
paper that either were related to other keywords such as “e-science”, “e-research”, “data
service”, “data curation”, “research data management” or “scientific data management” or
were not present in the Web of Science database.

3. Data science from the library and information science perspective


In recent years, the data science literature has increased significantly. Researchers have
discussed the concept and nature of data science, its origins, main areas, related concepts,
relations to other disciplines, knowledge and skills required from data scientists, business
value as well as potentials and limitations and data science-related activities. Virkus and
Garoufallou (2019a) provided a bibliographic analysis of data science publications reflected
in the Web of Science database in the research area of information science and library science
and a literature review. In the following sections of this paper, a content analysis of the
publications in the Web of Science database is provided. The publications are discussed in
chronological order in each thematic area.

3.1 Content analysis of the papers


In total, 80 papers in the research area of information science and library science of the Web of
Science database were examined. This section summarises the existing work published
between 2005 and April 2019. The papers are divided into six main categories: (1) data science
education and training; (2) knowledge and skills of the data professional; (3) the role of
libraries and librarians in the data science movement; (4) tools, techniques and applications of
data science; (5) data science from the knowledge management (KM) perspective; (6) data
science from the perspective of health sciences. There were a number of other topics, but these
were only addressed in a few papers.
3.1.1 Data science education and training. Data science education and training is one of the
main topics discussed in the information science and library science area of the Web of
Science database. “Facilitating the effective use of Earth science data in education through Data science
digital libraries” by Ledley et al. (2005), presented in the 5th ACM/IEEE-CS Joint Conference and
on Digital Libraries (JCDL’05), was the first paper recorded in the Web of Science database in
this research area. It shared experiences of a workshop, hosted by the Digital Library for
information
Earth System Education (DLESE) Data Services group in May 2004, which aimed to bridge science
the gap between the scientific and educational uses of data and involved participants from
critical areas of expertise such as earth science data providers, analysis tool developers,
scientists, curriculum developers and educators. Since the first article in 2005, publications on 645
various aspects of data science education and training appeared throughout the period
2005–2019.
For example, Stanton et al. (2012) discussed the interdisciplinary nature of data science
education. Data science education should include a deep understanding of how project data
are collected, pre-processed and transformed and an expertise across three main areas:
curation, analytics and cyber-infrastructure with deep knowledge in at least one of these
areas, as well as of domain knowledge.
Si et al. (2013) analyzed courses, degrees and programmes related to scientific data
curation provided by iSchools globally. They found that the curricula of iSchools covered
basic knowledge and methods of data curation, but lacked content such as the usage of data
curation tools and the approaches for user training.
Erdmann (2015) reported the experiences of a Harvard Library’s experimental course
aimed at training librarians for the growing data needs of their communities. The course
introduced the research data lifecycle and hands-on with the latest tools for extracting,
wrangling, storing, analyzing and visualizing data.
Using content analysis, Tang and Sae-Lim (2016) explored 30 randomly selected data
science graduate programmes in eight disciplines, including information science in the USA.
The results revealed significant gaps in current data science education. Most data science
courses covered the basic level of analytical skills, but upper-level skills were inadequately
addressed. While core courses provided information skills, communication and visualization
skills, their elective courses did not address such skills. The course offering on mathematics/
statistics was rather weak in iSchools.
Baskarada and Koronios (2017) interviewed nine managers from nine Australian
government agencies with relatively mature data science functions and identified six key
roles for an effective data science team: domain expert, data engineer, statistician, computer
scientist, communicator and team leader. Primary and secondary skills for each of the roles
were identified, and the resulting framework was then used to evaluate three data science
master-level degrees offered by three Australian universities. The analysis indicated that
there was a limited opportunity for specialization in the skill areas identified in the
framework. The authors were concerned that without deep expertise in any of the roles
identified in their framework, such graduates may not be able to effectively contribute to
multidisciplinary data science teams (Baskarada and Koronios, 2017, p. 72). Cervone (2017)
analyzed selected English language KM programmes at universities in the USA, Europe,
Australia and Asia to understand the scope and nature of these programmes. He found that
the concept of KM as a distinct program of study appears to be stable but the number of
programmes is declining. He found that these programmes were located in a variety of study
fields and coverage of the KM field is becoming increasingly diverse in its approach. It
appears that the KM programmes are moving toward transformation or integration with
allied fields and include also components of data science. Song and Zhu (2017) proposed a
layered Data Science Education Framework (DSEF) for iSchools. The building blocks
included the three pillars of data science (people, technology and data), computational
thinking, data-driven paradigms and the data science lifecycle. They suggested that data
science courses built on this framework should be user-based, tool-based and application-
DTA based. They identified three types of data scientists with different focus: (1) with hardcore
54,5 programming backgrounds who are developing and implementing new algorithms, tools and
systems and make systems effective and efficient; (2) with in-depth statistical knowledge who
are analyzing data with advanced statistical techniques; and (3) with extensive business
backgrounds who validate business models from data with a focus on business analytics
(Song and Zhu, 2017, p. 7). The user-based data science education implies the training of a data
scientist who solves data science problems in some application domains or who can use data
646 science products effectively, rather than those who develop data science products. Song and
Zhu (2017, pp. 7-8) note that the iSchool curriculum has strong advantages in user-based data
science education by training students who understand the importance of requirement
modelling, know the roles of metadata and utilize them, design and develop systems with
human-centred usability in mind, consider security and privacy of data at all the stages of the
data science lifecycle, validate analysis outcomes, perform appropriate storytelling for
stakeholders, manage projects with keen insights on data science lifecycles, protect data and
generate insights, utilize data and outcomes ethically and know how to set up strategies for
data archiving and curation. The tool-based data science education emphasizes the importance
of automated tools and utilization of available libraries and therefore students learn both
high-level analytical languages and data science tools. Learning a programming language
helps students think logically and computationally. This approach gives students experience
in developing new applications with existing libraries and helps them focus on problem-
solving rather than low-level coding. In addition, students also learn one or more major data
science tools (Song and Zhu, 2017, p. 9). The key points of application-based data science
education are threefold, developing (1) project-based education, (2) the ability to work with
domain experts and (3) the expertise in one application domain. Students experience a series
of comprehensive case studies covering each step of the data science lifecycle, which gives
them a big picture of a data science project, allows them to think about a domain they want to
work on further and helps them choose a specific area they want to develop their own
expertise. Then, there should be a capstone course/module that students work on as a team to
address real-world projects (Song and Zhu, 2017, p. 10).
Poulova et al. (2018) described experiences in developing the data science study
programme at the University of Hradec Kralove of the Czech Republic. Ortiz-Repiso et al.
(2018) made a cross-institutional analysis of data-related curricula across 65 iSchools. The
results showed that a majority of iSchools offered some form of data-related education,
particularly at the master’s level, and that approximately 15% of their formal degree
offerings had a data focus. iSchools had a greater emphasis on data science and big data
analytics (BDA), but few programmes were providing focused curricula in the area of digital
curation. Wang (2018) argued that LIS schools should integrate data science and information
science and develop organizational ambidexterity, believing that information science can
make unique contributions to data science research.
Thus, 11 papers discussed data science education and training. Several studies explored
courses, degrees and programmes related to data science (Si et al., 2013; Tang and Sae-Lim,
2016; Baskarada and Koronios, 2017; Cervone, 2017; Ortiz-Repiso et al., 2018). Some papers
introduced data science initiatives (Ledley et al., 2005; Erdmann, 2015; Poulova et al., 2018) or
discussed conceptual issues related to data science education (Stanton et al., 2012; Wang,
2018). Tang and Sae-Lim (2016) revealed significant gaps in current data science education in
the USA, while Si et al. (2013) found that the curricula of iSchools covered basic knowledge
and methods of data curation, but lacked some required content. Ortiz-Repiso et al. (2018) also
found that only a few programmes were providing focused curricula in the area of digital
curation in iSchools. Baskarada and Koronios (2017) were critical that the selected data
science master’s degrees offered by Australian universities only produced quasi-unicorns,
and there was a limited opportunity for specialization. Song and Zhu (2017) proposed a
layered DSEF for iSchools. It was believed that information science could make a valuable Data science
contribution to data science (Wang, 2018), and it would be beneficial to view data science from and
a multidisciplinary team perspective (Baskarada and Koronios, 2017).
3.1.2 Knowledge and skills of the data professional. The knowledge and skills of data
information
professionals are closely linked to the education and training of data scientists. For example, science
Stanton et al. (2012) indicated that data scientists must have a deep understanding of how
project data are collected, pre-processed and transformed, and they must possess expertise
across three main areas: curation, analytics and cyber-infrastructure with deep knowledge in 647
at least one of these areas, as well as of domain knowledge. They defined data scientists as
information professionals who contribute to the collection, cleaning, transformation, analysis,
visualization and curation of large, heterogeneous data sets, which are information science-
driven tasks and key to the data science domain. They were concerned that several
conceptions of data science focussed primarily on analytical methods.
Si et al. (2013) identified the primary duties of data specialists (described in section 3.1.3).
They found that libraries value teamwork, communication and interpersonal ability and a
good use of data curation tools as the core competences for scientific data specialists.
Preference will be given to candidates with a second advanced degree, who understand
libraries, have proven knowledge of metadata standards and emphasize details.
Antell et al. (2014) explored science librarians’ awareness of and involvement in
institutional repositories, data repositories and data management support services at their
institutions and the skills that science librarians perceived necessary for data management
work. The results of the online survey from 175 responses from the academic libraries
affiliated with the Association of Research Libraries (ARL) showed that the most frequently
cited were knowledge of the data lifecycle, subject-specific knowledge or skills,
communication, networking and reference skills followed by metadata skills, software or
computer skills and knowledge of the research process. However, the results revealed
uncertainty about the skills that will be required, but optimism about applying traditional
librarianship skills to this emerging new field of academic librarianship. Traditional
reference skills (the ability to liaise, refer, consult and teach) were among the skills that survey
respondents cited most frequently as being necessary for science librarians who plan to assist
researchers with data management.
Carter and Sholler (2016) conducted interviews with 18 data analysts from various
industries with varying levels of experience in extracting, analyzing and using data. They
evaluated these interviews in light of the hype and criticisms surrounding data science in the
popular discourse. They found that although the data analysts were sensitive to both the
allure and the potential pitfalls of data science, their motivations and evaluations of their
work were more nuanced. The most common tasks of data scientists included data gathering,
analysis and presentation of results to peers or managers containing visualizations from the
analysis they conducted. Because the interviewees worked in a variety of industries, tasks
within each of these stages varied. For example, some analysts used proprietary data and
technologies, while others worked with public data and open-source technologies. Some
worked with others on a day-to-day basis, while others communicated less frequently. The
study emphasized data scientists’ need for certain personality characteristics such as
creativity and curiosity.
Baskarada and Koronios (2017) who evaluated data science master’s degrees offered by
Australian universities, described in section 3.1.1, identified six key roles for an effective data
science team, and primary and secondary skills for these roles. (1) Domain experts, in addition
to domain expertise, require some statistical skills as secondary skills to facilitate
identification/generation of relevant questions and hypotheses, as well as interpretation of
results. (2) Data engineers require in addition to data preparation skills (e.g. data extraction,
cleaning, enrichment, transformation) some domain expertise for data preparation as
DTA secondary skills. (3) Statisticians form a bridge between domain experts, data engineers and
54,5 computer scientists. In addition to experimental design and hypothesis testing skills, they
require some domain expertise, data preparation expertise and solid understanding of skills
that are at the intersection of statistics and computer science. (4) Computer scientists require
proficiency in various tools and technologies and relevant programming languages like R
and Python, as well as cluster and cloud computing and skills in text analytics and natural
language processing. They require some data preparation skills and reasonably advanced
648 statistical skills. (5) Communicators form a bridge between data science teams and relevant
decision makers. They require significant domain expertise and some statistical skills to be
able to present analytical findings in a form that is visually appealing, easy to understand and
ultimately convincing. (6) Team leaders require project management expertise and some
understanding of all the other roles to bring everyone together, manage resources, tasks and
deliverables.
Da Sylva (2017) explored how information professionals prepare themselves to manage
different types of data. She suggested that three components were important: a clear
understanding of the different types of data, an initiation to the resources required to process
each type of data and an understanding of the impact that each type will have on information
science as a discipline and on the practice of information professionals. Chen and Zhang
(2017) analyzed 70 job advertisements from five academic and professional online job lists
and indicated that most job positions required that the successful applicant should be able to
serve faculty and students to collect, manage and analyze research data with essential
qualifications to carry out those tasks. Kennan (2017) interviewed 36 practicing data
professionals and their employers about current knowledge and skills requirements in
Australia. In universities and scientific research organizations, the required knowledge and
skills were in the area of data management and curation, and in business and government
organisations, in the area of data science and management. The participants reported the
importance of high-level communication and personal learning skills, curiosity, flexibility
and comfort with change. It was evident that data work was often conducted by teams of
differently qualified professionals rather than by individuals or groups of similarly qualified
people. Costa and Santos (2017) proposed a conceptual model for the professional profile of a
data scientist and evaluated the representativeness of this profile in two commonly
recognized competences/skills frameworks in the field of information and communications
technology (ICT), namely, in the European e-Competence (e-CF) framework and the Skills
Framework for the Information Age (SFIA). They found that a significant part of the
knowledge base and skills set of data scientists are related with ICT competencies/skills,
including programming, machine learning and databases. The data scientist professional
profile has an adequate representativeness in the e-CF and SFIA frameworks, but it is mainly
seen as a multi-disciplinary profile, combining computer science, statistics and mathematics.
Ghasemaghaei et al. (2018) developed and validated the concept of data analytics
competency as a five multi-dimensional formative index (i.e. data quality, bigness of data,
analytical skills, domain knowledge and tools sophistication) and examined empirically its
impact on the firm decision-making performance (i.e. decision quality and efficiency). The
findings based on an empirical analysis of survey data from 151 information technology (IT)
managers and data analysts demonstrated a large, significant and positive relationship
between data analytics competency and firm decision-making performance. The results
revealed that all dimensions of data analytics competency significantly improved decision
quality. In addition, all dimensions, except bigness of data, significantly increased decision
efficiency.
Ten papers focused on the knowledge and skills of data professionals. Two of them
addressed also the role of data science education and training and were already described in
section 3.1.1 (Si et al., 2013; Baskarada and Koronios, 2017). Most of the papers in this category
were research papers analyzing library-released job advertisements and courses, degrees and Data science
programmes (Si et al., 2013; Baskarada and Koronios, 2017; Chen and Zhang, 2017), and
conducting surveys (Antell et al., 2014; Ghasemaghaei et al., 2018), interviews (Carter and
Sholler, 2016; Baskarada; Koronios, 2017; Kennan, 2017) and document analysis (Costa and
information
Santos, 2017) to identify knowledge and skills of data professionals. Domain expertise, data science
preparation, management and curation competencies, traditional library work competencies,
statistical skills, ICT and programming skills and general skills (e.g. teamwork,
communication, project management, interpersonal ability, personal learning skills, 649
curiosity, flexibility, comfort with change) were emphasized as necessary competencies for
data professionals (Si et al., 2013; Antell et al., 2014; Baskarada; Koronios, 2017; Chen and
Zhang, 2017; Costa and Santos, 2017; Da Sylva, 2017; Kennan, 2017; Ghasemaghaei et al.,
2018). The professional profile of data scientists was mainly seen as a multi-disciplinary
profile, combining computer science, statistics, mathematics (Costa and Santos, 2017) and LIS
(Si et al., 2013; Antell et al., 2014; Chen and Zhang, 2017). However, many organizations still
lack clear understanding of the required roles and competencies of data professionals.
3.1.3 The role of libraries and librarians in the data science movement. Si et al. (2013)
analyzed information about library-released job advertisements for scientific data specialists
(also described in Sections 3.1.1 and 3.1.2). The results showed that libraries are actively
participating in the curation of scientific data. However, the orientation of the role of the
scientific data specialist was not explicit and a consensus on its scope had not been reached.
The main roles of scientific data specialists included offering consultation and reference
services for research and scientific data curation, providing instruction and training to users,
including helping them understand the significance of scientific data curation and master the
usage of various tools for data processing, analysis and statistics. As scientific data
specialists have to work with research teams and participate in classroom teaching, the
scientific data curation service has become a newly embedded information service. A number
of employers believed that employees should serve as a liaison between academic units and
the library to solicit users’ data curation needs and collect their feedback about the
service.Assisting researchers with the creation of working files such as data management
plans, data quality assurance plans and data management implementation plans also was in
the primary duties of scientific data specialists. Furthermore, metadata design and creation
was a major responsibility of the position. In addition, participating in the development of
policies and procedures for scientific data curation, giving guidance in research data
collection, analysis and storage, joining organisations within the field of data curation and
engaging in their launching initiatives, all served as the key duties of scientific data
specialists. Other duties that were mentioned were: delivering scientific data navigation
services; constructing integrated retrieval systems; carrying out research on related issues of
data curation, such as data mining, digital publishing and visual data analysis; and
promoting open access and sharing of scientific data to facilitate scholarly communication.
The authors conclude that those duties cover the whole lifecycle of scientific data curation,
namely, data collection, data processing, data accessing and utilization and data
preservation. Thus, it can be concluded that libraries and librarians have an important
role in the scientific data curation process.
Antell et al. (2014) (described in section 3.1.2) also investigated science librarians’ roles and
responsibilities related to data management; the results revealed uncertainty about the roles
of librarians, libraries and other campus entities. However, some librarians expressed the
opinion that data management duties are a natural extension of the science librarian’s job.
Borgman et al. (2015) addressed the role of digital libraries in knowledge infrastructures
for science, presenting evidence from long-term studies of four research sites. They
highlighted the need for expertise in digital libraries, data science and data stewardship
DTA throughout all four sites. Examples were presented of the challenges in designing digital
54,5 libraries and knowledge infrastructures to manage and steward research data.
Maxwell et al. (2018) believe that the emergence of data-driven research and discovery
may be one of the greatest strategic opportunities for academic libraries, in particular in data
curation and data analysis. They reported the results of a survey, conducted at the University
of Florida, which indicated a high demand for training in analytical tools and technologies.
Koltay (2019) identified tasks and roles that academic libraries have to fulfil to respond to the
650 data-intensive research. He focuses on theoretical considerations and practical experiences
related to research data management, understood as a major complex of services offered to
scholars, involved in data-intensive research. He notes that academic libraries show varied
levels of readiness, preparedness or maturity to take responsibility for these services;
national developments in this field also vary. Researchers need support in planning,
organizing, security, documenting and sharing data sets for deposit, preserving them on a
short- and long-term basis and on copyright, licensing and intellectual property issues. To
address all these issues, libraries must engage in high levels of interaction with researchers
and cooperate with other support service providers. Koltay notes that although there are
similarities between duties and skills of data-related professions (e.g. data librarians, data
scientists, records professionals), data librarians are not data scientists; their work
environments, culture and scope of duties differ. However, data science provides new
methods and practices for data librarianship, without requiring data librarians to become
programmers, statisticians or database managers (Koltay, 2019, p. 78).
Five papers in this category explored the role of libraries and librarians in the data science
movement. Data management work in research libraries seem to be in its emergent phase.
Although the authors of these papers saw strategic opportunities and an important role for
academic libraries, there remains much uncertainty about the roles of librarians and libraries
in data management.
3.1.4 Tools, techniques and applications of data science. Several authors discussed tools,
techniques and applications of data science. Park and Leydesdorff (2013) used semantic
network analysis to examine international co-authorship. They provided an empirical
analysis of semantic patterns of paper titles. The results showed that internationally co-
authored papers tend to focus on primary technologies, particularly in terms of programming
and related database issues, and a combination of words and locations can provide a richer
representation of an emerging field than the sum of the two separate representations.
Mitchell (2015) explored which aspects of reproducibility as a research practice are
applicable to technical service processes in data science. He gave an overview of reproducible
principles and presented a framework for evaluating levels of reproducibility.
Kocheturov and Pardalos (2016) described a history of massive networks, their place in
modern life and discussed open problems related to them in data science. They considered
how real-life massive data sets could be represented in terms of networks describing some
examples and summarizing properties of such networks. They also discussed cases of
modelling real-life massive networks and gave examples of how to optimize in massive
networks and in which areas these techniques can be applied. Biswas (2016) introduced the
useful data structures for the programmers working with big data. Gollub et al. (2016)
demonstrated the potential of topical sequence profiling as an effective data science
technology. Larson and Chang (2016) explored the application of agile methodologies and
principles to business intelligence delivery and new trends such as fast analytics and data
science as part of business intelligence. Leung et al. (2016) presented a big data science
solution for social computing and social network analytics so as to provide services and
support to big data mining of interesting patterns from big social networks that are stored in
key-value databases. Newman et al. (2016) proposed a business data science (BDS) model that
allows different types of functions, processes and roles to work together collaboratively for
efficiency and performance improvements. Examples were provided and future directions Data science
were discussed to ensure that business intelligence, security, analytics and research and
contributions to BDS could be achieved. Yousafzai et al. (2016) analyzed directory-based
incentive management services for ad hoc mobile clouds and proposed a directory-based
information
architecture that keeps track of the retribution and reward valuations for devices even after science
they move from one ad hoc environment to another.
Umachandran and Ferdinand-James (2017) discussed the use of big data applications in
agriculture, manufacturing and education, using technological tools such as Hadoop, Hive, 651
Sqoop and MongoDB. Qasim et al. (2017) analyzed the production and consumption of
scientific knowledge across the regions in the field of sustainable and renewable energy using
publications and citations data indexed in Scopus. The results showed that research topics
produced by the USA are consumed in different international contexts, and the use of
advanced data mining and computing methods for deriving critical insights for the use of
scientific knowledge is an action towards the global knowledge society vision. Lorentzen and
Nolin (2017) explored problems of sampling and completeness through the specific example
of conversations in Twitter. They found that different network analysis techniques and
filtering options give different results with regard to prominent users. Xia et al. (2017) tested
the hypothesis that the value of data for scientific investigators, in terms of the impact of the
publications based on the data, decreases over time through a mixed linear effects model
using approximately 1,200 publications between 2007 and 2013 that used data sets from the
database of genotypes and phenotypes, a data-sharing initiative of the National Institutes of
Health (NIH). The analysis showed that the impact factors for publications based on database
of genotypes and phenotypes data sets depreciate in a statistically significant manner.
However, they further discovered that the depreciation rate was slow, only 10% per year, on
average.
Almugbel et al. (2018) explored the use of interactive software notebooks to document and
distribute research. They provided a user-friendly tool, BiocImageBuilder, that allows users
to easily distribute their bioinformatics protocols through interactive notebooks uploaded to
either a GitHub repository or a private server. They presented four different interactive
notebooks that can be used to disseminate a wide range of bioinformatics analyses. Barbuti
et al. (2018) described the application of data science to studies in data humanities; the
algorithm was applied to find new research hypotheses through the discovery of patterns
directly inferred from large digital libraries. Estiri et al. (2018) provided an open-source,
interoperable and scalable data quality assessment tool for evaluation and visualization of
completeness and conformance in electronic health record (EHR) data repositories. They
described the tool’s design and architecture and gave an overview of its outputs using a
sample data set of 200,000 randomly selected patient records. Guo et al. (2018) developed a
deep learning model to integrate geographical and social influences for personalized point-of-
interest (POI) recommendation tasks. The model contributes to the effective usage of data
science and analytics for social recommender system design. The results can be used to
improve the quality of personalized POI recommendation services for websites and
applications. Alluqmani and Shamir (2018) used automatic text analysis to analyze the
writing styles in computer science, mathematics, physics and astrophysics. More than 9,000
scientific papers published between 2000 and 2016 were analyzed for each discipline. The
study showed statistically significant differences between the different disciplines such as
use of acronyms, sentence length and word length. The findings also showed changes in
writing styles in specific disciplines over time. Zhou et al. (2018) presented research that
developed a probabilistic model to extract linear terrain features from high-resolution digital
elevation models (DEMs). The proposed model takes full advantage of spatio-contextual
information to characterize terrain changes. Through a series of experiments, they
demonstrated that the proposed approach outperforms existing techniques, including
DTA thresholding, stream/drainage network analysis, visual descriptor detection, object-based
54,5 image analysis and edge detection.
Halim and Khan (2019) presented a data science-based framework that evaluates journals
based on their key bibliometric indicators and presents an automated approach to categorize
them. Cho (2019) performed co-occurrence analysis on keywords assigned to research data in
the field of LIS, which were archived in the Figshare repository, to identify which research
data are actively produced and shared in the field of LIS. Four major domains (open access,
652 scholarly communication, data science and informatics) and 15 sub-domains were created.
The keywords with the highest global influence were: open access, scholarly communication
and altmetrics.
Altogether, 21 papers discussed various tools, techniques and applications of data science.
This category had the most papers; it shows that there are numerous data science tools,
techniques and applications that help obtain value from data, study data and its patterns and
generate outcomes from it. The papers also demonstrated that various tools, techniques and
applications of data science are applicable in a wide range of fields.
3.1.5 Data science from the knowledge management perspective. Several papers explored
data science from the KM perspective. For example, based on a literature review and analysis
of KM and data analysis systems implemented by companies, Intezari and Gressel (2017)
provided a conceptual framework about how KM systems can incorporate big data into
strategic decisions. They identified four main types of decision-making that depend on
whether a decision and the underlying data are structured (SD) or unstructured (UD): SD-SD,
SD-UD, UD-SD and UD-UD. They argued that existing KM systems need to be upgraded to
advanced KM systems – a particular type of KM system that can help an organization to
integrate big data into its knowledge and knowledge repository and generate more value
from the organization’s existing KM systems. They suggested five key features of the
advanced KM systems (social, cross-lingual, integrative, dynamic and agile, simple and
understandable) that enable support to all four types of data-driven decisions by
accommodating diverse sources of data and knowledge. Advanced KM systems go
beyond a simple text mining tool, or a document analysis mechanism, or a mere knowledge
sharing system. These systems allow for the integration of human knowledge and insight
with big data and facilitate the incorporation of big data and knowledge into strategic
decisions.
Thirathon et al. (2017) investigated how managerial decision-making is influenced by big
data, analytics and an analytic culture. The results of a cross-sectional survey of 163 senior IT
managers revealed that BDA created an incentive for managers to base more of their
decisions on analytic insights. They found that the main driver of analytic-based decision-
making is an analytic culture – the attitude towards the usefulness, use and benefits of
analytics. The analytic culture in an organization is a far stronger predictor of analytic-based
decision-making than the sophistication of BDA practices. They conclude that firms with a
highly analytic culture can use this resource for competitive advantage.
In an editorial for a journal special section, de Vasconcelos and Rocha (2017) highlighted
data science as a key challenge in the KM context that can help reduce uncertainty and
discover new patterns of organizational behaviour. They argued that effective KM and
engineering are based on the dynamic nature of organizational knowledge, and predictive
data analysis and insights identification can transform and add value to the organization.
The papers in that special section included a set of data science techniques, such as data
mining, machine learning and big data approaches for network analysis and corporate KM
practices.
Cervone (2017) analyzed curricula of selected KM programmes in the context of data
science. As historically, data mining and related activities have played a large role in KM
practice, he suggested that perhaps data science and its affiliated sub-disciplines are simply
the natural evolution of KM, as it becomes more integral to professional practice in Data science
organizations. He concluded that if this was the case, ensuring that core KM principles and and
theory were integrated into data science programmes would be an important focus of future
efforts in the field.
information
Ghasemaghaei et al. (2018) examined empirically the impact of data analytics competency science
on the firm decision-making performance. The results revealed that all dimensions of data
analytics competency significantly improve firms’ decision quality and efficiency (except
bigness of data in case of decision efficiency). 653
Song et al. (2018) examined the business value of data analytics usage and how such value
differs in different market conditions. They found that both demand- and supply-side data
analytics usage had a positive effect on the performance of merchants. When merchants’
product variety was high, the influence of usage towards demand-side data on performance
was strengthened, whereas such impact was weakened for supply-side data analytics. In
addition, when competitive intensity was high, the performance implication of demand-side
data analytics usage was strengthened, whereas such impact was not strengthened for
supply-side data analytics.
Mandal (2019) explored the impact of BDA management capabilities: planning,
investment decision-making, coordination and control on supply chain (SC) dimensions:
preparedness, alertness and agility. The findings indicated that BDA planning, coordination
and control were critical enablers of SC preparedness, alertness and agility. BDA investment
decision-making did not have a prominent influence on any of the SC resilience dimensions.
Seven papers explored data science from the KM perspective. It was stated that data
mining and related activities have played a large role in KM practice. Therefore, we could
argue that perhaps data science and its affiliated sub-disciplines are just a natural evolution
of KM (Cervone, 2017). Data science and advanced analytics are key challenges in the KM
context that help to make strategic decisions (Intezari and Gressel, 2017; Mandal, 2019),
reduce the uncertainty, discover new patterns of organisational behaviour (de Vasconcelos
and Rocha, 2017) and significantly improve decision-making performance in organizations
(Thirathon et al., 2017; Ghasemaghaei et al., 2018; Song et al., 2018). However, the existing KM
systems need to be upgraded to advanced KM systems (Intezari and Gressel, 2017).
3.1.6 Data science from the perspective of health sciences. Several publications discussed
data science from a health sciences perspective. Ohno-Machado (2013) provided a short
editorial introduction to the special issue of the Journal of the American Medical Informatics
Association (JAMIA), which focused on imaging informatics, and patient- and provider-
centred studies of health IT, representing a broad spectrum of work done by biomedical data
scientists. It was acknowledged that health-relevant big data present unique challenges, and
biomedical data scientists are in very high demand. The author emphasized the urgent need
to train more experts in biomedical data science/biomedical informatics and to invest in
biomedical informatics research to develop tools that make full use of health-related data.
Margolis et al. (2014) described the NIH’s Big Data to Knowledge (BD2K) initiative that
was capitalizing on biomedical big data. BD2K consisted of four focused areas: (1) improving
the ability to locate, access, share and use biomedical big data; (2) developing and
disseminating data analysis methods and software; (3) enhancing training in biomedical big
data and data science; and (4) establishing centres of excellence in data science. It was noted
that addressing the challenges associated with biomedical big data required all parts of the
big data ecosystem to be engaged. BD2K was deploying an integrated plan of action that
tackled numerous aspects of the big data challenge, including multiple elements of data
science, training, policy and community behaviour.
Ku et al. (2015) described the Mobilize Centre, which aimed to harness a huge amount of
data characterizing human movement (e.g. from research labs, clinics, smartphones and
wearable sensors) to advance human movement research and improve mobility and help lay
DTA the foundation for using data science methods in biomedicine. The centre was organized
54,5 around four data science research cores: biomechanical modelling, statistical learning,
behavioural and social modelling and integrative modelling. The centre developed new
approaches, shared data and validated software tools and trained researchers. Kumar et al.
(2015) described the Centre of Excellence for mobile sensor data-to-knowledge (MD2K), which
was chosen as one of 11 Big Data Centres of Excellence by the NIH, as part of its Big Data-to-
Knowledge initiative. MD2K developed innovative tools to streamline the collection,
654 integration, management, visualization, analysis and interpretation of health data generated
by mobile and wearable sensors. The goal of the big data solutions being developed by MD2K
was to reliably quantify physical, biological, behavioural, social and environmental factors
that contribute to health and disease risk. The research conducted by MD2K was targeted at
improving health through early detection of adverse health events and by facilitating
prevention. MD2K made its tools, software and training materials available and organized
workshops and seminars to encourage their use by researchers and clinicians.
Amirian et al. (2017) discussed data science and its foundation in the context of healthcare.
Applications of data science were illustrated as analytical tasks in regression, classification,
clustering, similarity matching, content analysis, simulation and profiling categories. Then,
data science process and steps were discussed in the context of Cross-Industry Standard
Process for Data Mining (CRISP-DM). Concepts of success criteria and model performance
were illustrated thoroughly in the context of predictive analytics and data science tools,
environments and software were discussed. They noted that there is a wide spectrum of
opportunities for using data science methods for improving healthcare systems. Xia et al.
(2017) tested the hypothesis that the value of data for scientific investigators, in terms of the
impact of the publications based on the data, decreases over time on the basis of biomedical
data sets.
Almugbel et al. (2018) explored the use of interactive software notebooks to document and
distribute bioinformatics research. Four interactive notebooks that can be used to
disseminate a wide range of bioinformatics analyses were presented. Estiri et al. (2018)
provided a data quality assessment tool for evaluation and visualization of completeness and
conformance in EHR data repositories. Ohno-Machado’s (2018a, b) editorial material was an
introduction to the JAMIA focussing on biomedical data science. The journal illustrated a
broad range of techniques and application areas in this field and highlighted tools and
applications of data science in a variety of domains, all of which use clinical text as a source of
data. Ohno-Machado (2018a, b) also wrote editorial highlights to the other issue of JAMIA,
which dealt with data science and artificial intelligence (AI) that can improve clinical practice
and research. Brennan et al. (2018) provided editorial material to the JAMIA highlighting
biomedical informatics and data science as evolving fields with significant overlap. They
believed that biomedical data science offers new and powerful tools to better understand
health and disease through insights gleaned from data and has the potential to accelerate
data-driven discovery. The editorial introduced briefly the eight papers in the JAMIA special
issue that illustrated methods and motivations, data and analytics applied to make sense of
and draw biomedical and health implications from a wide range of observations about life
sciences phenomena that can be used to study health and disease. Spruit and Lytras (2018)
investigated adaptive analytic systems within the knowledge discovery process in
healthcare: domain and data understanding for physician- and patient-centric healthcare,
data pre-processing and modelling using natural language processing and (big) data analytic
techniques and model evaluation and knowledge deployment through information
infrastructures. They noted that the adaptive component in healthcare system prototypes
may translate to data-driven personalisation services, including personalised medicine. They
explored how applied data science for patient-centric healthcare can enable physicians and
patients to improve healthcare more effectively and efficiently. They proposed meta-
algorithmic modelling as a solution-oriented design science research framework in alignment Data science
with the knowledge discovery process. Courneya and Mayo (2018) described the project and
where the library supports analysis of high-throughput data from global molecular profiling
experiments by offering a high-performance computer with open-source software along with
information
expert bio-informationist support. The library’s bio-informationist identified the ideal science
computing hardware and a group of open-source bioinformatics software to provide analysis
options for experimental data such as scientific images, sequence reads and flow cytometry
files. The bio-informationist developed self-guided learning materials and workshops or 655
consultations. Researchers applied the data analysis techniques that they learnt in the
library’s ideal computing environment.
Evans and Krumholz (2019) discussed the creation of people-driven data collaboratives
based on the health-related experiences of individuals, with governance structures that
enable participants to have a meaningful voice in issues surrounding the use of their
own data.
In total, 14 papers explored data science from the perspective of health sciences. However,
three papers discussing health data science tools, techniques and applications also belonged
to the category 3.1.4 and four papers were short editorials to the JAMIA. The authors believe
that data science has the potential to revolutionize healthcare. There is large amount of data
from various sources such as hospital records, electronic medical records, clinical trials,
genetic information, insurance data, billing, wearable data, care management databases,
clinical studies, social media, etc. With the availability of data analytics methods, it is feasible
to make sense of all the accessible data to ask important questions and improve the healthcare
systems by building services for monitoring patients, identifying high risk populations,
predicting disease outbreaks to relevant organizations (Amirian, 2017, pp. 35–36).
3.1.7 Other topics. Some topics were discussed only in few papers. For example, three
papers addressed the relationship between data science and LIS (Cervone, 2016; Wang, 2018;
Hjørland, 2019). Cervone (2016) found that contributions of LIS communities would be critical
in addressing data science issues. Wang (2018) analyzed the mission statement and nature of
both data science and information science by reviewing existing works and drawing on the
data-information-knowledge-wisdom (DIKW) hierarchy. He found that the mission, task and
nature of data science and information science are congruent. They greatly overlap, share
similar concerns, are closely interrelated and together form the components of “information
chain” research. Furthermore, they can complement each other. Information science can make
unique contributions to data science research, including conception of data, data quality
control, data librarianship and theory dualism. The document theory, as a promising
direction of unified information science, should be introduced to data science to solve the
disciplinary divide. He found that the coherent set of meta-theoretical assumptions applies to
both of sciences. Hjørland (2019) considered the nature of data and “big data” and the relation
between data, information, knowledge and documents. He found that the most fruitful
theoretical frame for knowledge organization and data science is the social epistemology
suggested by Shera (1951). He also highlighted that some confusion in data science arises
from the notion of “data”, which varies considerably in different disciplines; it is not clear
whether the terms “data” and “document” are synonymous or not. He suggested that data
should be defined as information on properties of units of analysis and data can only be
managed if they are somehow recorded in documents.
Greenberg (2017) provided a framework for addressing the disconnect between metadata
and data science. She indicates that data science cannot progress without metadata research
and identified pathways for developing a more cohesive metadata research agenda in data
science. She identified factors that challenge metadata research in the digital ecosystem;
defined metadata and data science; and presented the concepts big metadata, smart metadata
and metadata capital as part of a metadata lingua franca connecting to data science.
DTA Four papers discussed data science from the perspective of information systems (IS). For
54,5 example, the editorial material by Agarwal and Dhar (2014) concluded that the IS discipline
has been thinking and researching questions at the intersection of technology, data, business
and society for five decades and should leverage its thought leadership to become a
centrepiece of education, business and policy.
Saar-Tsechansky (2015) noted in an MIS Quarterly editor’s comment that the possible use
of data science methods to business problems presents a wealth of research opportunities
656 that the IS data science community is well positioned to explore. Ghosh (2016) discussed
critical aspects of the big data problem and their importance as well as relevance to IS
research. He introduced the different aspects of the big data problem to IS scholars and
synthesized a research agenda by indicating some research areas that are relatively
unexplored or underrepresented in the literature. Addressing big data issues concerning
human–computer interactions, mixing of domain knowledge by technical experts, as well as
security and privacy were some of the major challenges that would channel IS research into
directions unexplored in the past. He believed that big data and analytics hold immense
research potential for the IS community. Berente et al. (2018) believed that increasingly
abundant trace data provide an opportunity for IS researchers to generate new theory. They
draw on the largely “manual” tradition of the grounded theory methodology and the highly
“automated” process of computational theory discovery in the sciences to develop a general
approach to computationally intensive theory development from trace data. This approach
involved the iterative application of four general processes: sampling, synchronic analysis,
lexical framing and diachronic analysis. They provided examples from recent IS research.
Two papers explored data science opportunities from the legal perspective (Schweighofer,
2015; Waltl et al., 2015). Schweighofer (2015) noted that the potential of AI and law methods in
law has been not properly used. The goal of legal data science is to complement the existing
methodology of law with the new computer-based methods and to bring it into a theoretical
framework. The author focused on the man/machine delivery of the desired products of legal
knowledge representation using AI and law methods. At present, a lot of analysis is done
manually, but the lack of sufficient resources becomes more and more evident. Waltl et al.
(2015) developed a flexible reference architecture for software-supported analysis and
annotation of semantic and linguistic properties of legal texts. They provided a Web-based
software application (LEXIA) that was tailored to German legal texts and can easily be
extended with arbitrary text mining modules. It implements a lean data model for the internal
representation of legal texts, considering the characteristics of the German legislation. The
main focus in the architectural design is the support of extensibility and adaptability. They
conducted a case study based on the German tenancy law, including relevant judgements
from the German Federal Court of Justice.
In the editorial material, Sundararajan et al. (2013) suggested a broad direction for research
into social and economic networks. They proposed four kinds of investigation as most
promising: (1) how ITs create and reveal networks whose connections represent social and
economic relationships; (2) the content that flows through networks and its economic, social
and organizational implications; (3) theories and methods to understand and utilize the rich
predictive information contained in networked data; and (4) network dynamics and how IT
affects network evolution. They discussed how the interconnected nature of these areas of
enquiry could lead to a new cumulative research tradition.
Rentier (2016) reviewed the emerging concepts in scholarly publication and aimed to
answer frequently asked questions concerning free access to scientific literature as well as to
data, science and knowledge in general. The paper provided new observations concerning the
level of compliance to institutional open-access mandates and the poor relevance of journal
prestige for quality evaluation of research and researchers. The results of introducing an
open-access policy at the University of Liege were noted. Sheble (2016) outlined through
selective literature review research topics that intersect with LIS and research synthesis Data science
methods. Topics identified included open access, information retrieval, bias and research and
information ethics, referencing practices, citation patterns and data science. Zoltan (2016)
presented an interpretation about the history of science as a history of data and data amounts.
information
Saltz et al. (2017) reported on a set of case studies where researchers were embedded within science
data science teams to identify the attributes that can help describe data science projects and
challenges faced by the teams. They identified 14 characteristics that can help describe a data
science project and used these characteristics to create a model that defined two key 657
dimensions of the project. Finally, by clustering the projects within these two dimensions,
they identified four types of data science projects, and based on the type of project, they
identified some of the sociotechnical challenges that project teams should expect to encounter
when executing projects. Sexton et al. (2017) explored the notion of trust from the stakeholder
perspectives in relation to government administrative data sharing and re-use in England.
They demonstrated that securing public trust in data initiatives is dependent on a broader
balance of trust between a network of actors involved in data sharing and use.
Rempel et al. (2018) offered propositions for government data science public engagement
practice that were rooted in empirical lessons from public engagement literature. They
provided a narrative literature review of public engagement with new technology and
synthesized five themes that focussed on public engagement with new technology. These
themes were then used to develop five novel propositions for public engagement with
government data science. This included considering the varied and many “publics” who may
be engaged in government data science, not assuming that providing publics with
information on data science initiatives will lead to public acceptance; determining the
contingencies of trust for government data science and public engagement through
trustworthy practice; and designing public engagements that incorporate robust, critical and
ongoing deliberation of data science. Their final proposition was to ensure holistic public
participation that moves beyond privacy and consent. Foster et al. (2018) reviewed data work,
and how negotiating a trade-off between its value and risks requires locating its processes
within the contexts of its conditions and consequences. These included international, national
and sectoral conditions of law, policy and regulation at a macro level; organizational
conditions of information and data governance that aimed to address the value and risks of
data work at a meso level, along with attention to the everyday contexts of data and
information handling by data information and other professionals at a micro level. A
conceptual framework was presented that located the processes of data work within the
matrix of its macro, meso and micro conditions; its consequences for individuals,
organizations and society; and the relations between them. Aristodemou and Tietze (2018)
presented a narrative literature review of the state of the art in intellectual property analytics
in four main categories: KM, technology management, economic value and extraction and
effective management of information.
Beaton (2016) examined Lionel Trilling’s 1948–1950 essay about data, which framed data
as part of a broader cultural history, including literature, drama, epic poetry and the arts.
There were also some book reviews. For example, of Granville’s “Developing Analytic
Talent: Becoming a Data Scientist” (2014) by Ting (2015), for Kelleher and Tierney’s “Data
science” (2018) by Wilson (2018) and for Cady’s “Data Science Handbook” (2017) by
Brunner (2018).
In total, 23 papers belonged to the category of other topics. These papers explored the
relationship between data science and LIS (Cervone, 2016; Wang, 2018; Hjørland, 2019) and
between metadata and data science (Greenberg, 2017). Several papers discussed data science
from the perspective of IS (Agarwal and Dhar, 2014; Saar-Tsechansky, 2015; Ghosh, 2016;
Berente et al., 2018) and legal science (Schweighofer, 2015; Waltl et al., 2015). There were also
papers that discussed research into social and economic networks (Sundararajan et al., 2013),
DTA reviewed scholarly publication process (Rentier, 2016; Sheble, 2016), data work (Foster et al.,
54,5 2018), intellectual property analytics (Aristodemou and Tietze, 2018) and the notion of trust
from the stakeholder perspectives (Sexton et al., 2017). Reviewed papers offered propositions
for government data science public engagement practice (Rempel et al., 2018) and identified
the attributes that can help describe data science projects and the challenges (Saltz et al.,
2017). Data were also examined in a historical context (Zoltan, 2016) and as part of broader
cultural history (Beaton, 2016). There were also several book reviews (Ting, 2015; Wilson,
658 2018; Brunner, 2018).

4. Conclusions
The LIS contribution to data science in the period 1980–2019 according to the Web of Science
database was quite limited – 3.4%. The first paper was published in 2005 and the number of
articles have increased over the past few years. It appears that there has been continuous
increase in articles from 2015. The main document types are journal articles, followed by
conference proceedings and editorial material.
The analysis revealed that data science is quite interdisciplinary by nature. The reviewed
articles were diverse in content. In addition to the identified six broad categories (data science
education and training; knowledge and skills of the data professional; the role of libraries and
librarians in the data science movement; tools, techniques and applications of data science;
data science from the KM perspective; data science from the perspective of health sciences),
the topics included big data issues, data structures, information and data visualization,
solution for social computing and social network analytics, the application of agile
methodologies and principles to business intelligence, topical sequence profiling, directory-
based incentive management services for ad hoc mobile clouds, BDS models and access to
scholarly publications. Several publications fell into several categories because these topics
were closely related. These topics were explored from the perspective of research or practice;
for example, from the perspective of the information professional and data analysts or
information systems or health sciences research. Data science was also discussed in the
historical context and as a part of a broader cultural history.
The category of tools, techniques and applications of data science was most addressed by
the authors and indicates that there are numerous data science tools, techniques and
applications that help obtain value from data, study data and its patterns and generate
outcomes from it and are applicable in a wide range of fields. This category was followed by
data science from the perspective of health sciences, data science education and training and
knowledge and skills of the data professional. Based on the analyzed publications, several
fields such as LIS, information systems, KM and health sciences provide valuable
contributions to data science.

References
Agarwal, R. and Dhar, V. (2014), “Big data, data science, and analytics: the opportunity and challenge
for IS research”, Information Systems Research, Vol. 25 No. 3, pp. 443-448.
Alluqmani, A. and Shamir, L. (2018), “Writing styles in different scientific disciplines: a data science
approach”, Scientometrics, Vol. 115 No. 2, pp. 1071-1085.
Almugbel, R., Hung, L.H., Hu, J., Almutairy, A., Ortogero, N., Tamta, Y. and Yeung, K.Y. (2018),
“Reproducible Bioconductor workflows using browser-based interactive notebooks and
containers”, Journal of the American Medical Informatics Association, Vol. 25 No. 1, pp. 4-12.
Amirian, P., van Loggerenberg, F. and Lang, T. (2017), “Data science and analytics”, in Amirian, P.,
Lang, T. and van Loggerenberg, F. (Eds), Big Data in Healthcare, SpringerBriefs in
Pharmaceutical Science and Drug Development, Springer, Cham, pp. 15-37.
Antell, K., Foote, J.B., Turner, J. and Shults, B. (2014), “Dealing with data: science librarians’ Data science
participation in data management at association of research libraries institutions”, College and
Research Libraries, Vol. 75 No. 4, pp. 557-574. and
Aristodemou, L. and Tietze, F. (2018), “The state-of-the-art on intellectual property analytics (IPA): a
information
literature review on artificial intelligence, machine learning and deep learning methods for science
analysing intellectual property (IP) data”, World Patent Information, Vol. 55, pp. 37-51.
Barbuti, N., Caldarola, T. and Ferilli, S. (2018), “A graphic matching process for searching and
retrieving information in digital libraries of manuscripts”, in Serra, G. and Tasso, C. (Eds), 659
Digital Libraries and Multimedia Archives. IRCDL 2018. Communications in Computer and
Information Science, Springer, Cham, Vol. 806, pp. 139-150.
Baskarada, S. and Koronios, A. (2017), “Unicorn data scientist: the rarest of breeds”, Program, Vol. 51
No. 1, pp. 65-74.
Beaton, B. (2016), “How to respond to data science: early data criticism by Lionel Trilling”,
Information and Culture, Vol. 51 No. 3, pp. 352-372.
Berente, N., Seidel, S. and Safadi, H. (2018), “Research commentary - data-driven computationally
intensive theory development”, Information Systems Research, Vol. 30 No. 1, pp. 50-64.
Biswas, R. (2016), “Introducing data structures for big data”, in Effective Big Data Management and
Opportunities for Implementation, IGI Global, pp. 25-52.
Borgman, C.L., Darch, P.T., Sands, A.E., Pasquetto, I.V., Golshan, M.S., Wallis, J.C. and Traweek, S.
(2015), “Knowledge infrastructures in science: data, diversity, and digital libraries”,
International Journal on Digital Libraries, Vol. 16 Nos 3-4, pp. 207-227.
Brennan, P.F., Chiang, M.F. and Ohno-Machado, L. (2018), “Biomedical informatics and data science:
evolving fields with significant overlap”, Journal of the American Medical Informatics
Association, Vol. 25 No. 1, pp. 2-3.
Brunner, R.J. (2018), “The data science handbook. Field Cady, John Wiley & Sons, Inc., Hoboken, NJ,
2017.416 pp”, Journal of the Association for Information Science and Technology, Vol. 69 No. 6,
pp. 861-863.
Cady, F. (2017), The Data Science Handbook, John Wiley & Sons, Hoboken, NJ.
Carter, D. and Sholler, D. (2016), “Data science on the ground: hype, criticism, and everyday work”,
Journal of the Association for Information Science and Technology, Vol. 67 No. 10, pp. 2309-2319.
Cervone, H.F. (2016), “Informatics and data science: an overview for the information professional”,
Digital Library Perspectives, Vol. 32 No. 1, pp. 7-10.
Cervone, H.F. (2017), “What does the evolution of curriculum in knowledge management programs tell
us about the future of the field?”, VINE Journal of Information and Knowledge Management
Systems, Vol. 47 No. 4, pp. 454-466.
Chen, H.L. and Zhang, Y. (2017), “Educating data management professionals: a content analysis of job
descriptions”, The Journal of Academic Librarianship, Vol. 43 No. 1, pp. 18-24.
Cho, J. (2019), “Subject analysis of LIS data archived in a Figshare using co-occurrence analysis”,
Online Information Review, Vol. 43 No. 2, pp. 256-264.
Costa, C. and Santos, M.Y. (2017), “The data scientist profile and its representativeness in the
European e-Competence framework and the skills framework for the information age”,
International Journal of Information Management, Vol. 37 No. 6, pp. 726-734.
Courneya, J.P. and Mayo, A. (2018), “High-performance computing service for bioinformatics and data
science”, Journal of the Medical Library Association, Vol. 106 No. 4, p. 494.
Da Sylva, L. (2017), “The theoretical and practical impact of data on information professionals”,
Documentation et Bibliotheques, Vol. 63 No. 4, pp. 5-34.
 (2017), “Special section on data science and business intelligence”,
de Vasconcelos, J.B. and Rocha, A.
International Journal of Information Management, Vol. 37 No. 6, pp. 716-717.
DTA Erdmann, C. (2015), “Data scientist training for librarians”, Library and Information Services in
Astronomy VII: Open Science at the Frontiers of Librarianship ASP Conference Series, Vol. 492,
54,5 pp. 31-37, available at: www.aspbooks.org/a/volumes/article_details/?paper_id536774
(accessed 12 July 2020).
Estiri, H., Stephens, K.A., Klann, J.G. and Murphy, S.N. (2018), “Exploring completeness in clinical data
research networks with DQe-c”, Journal of the American Medical Informatics Association,
Vol. 25 No. 1, pp. 17-24.
660 Evans, B.J. and Krumholz, H.M. (2019), “People-powered data collaboratives: fueling data science with
the health-related experiences of individuals”, Journal of the American Medical Informatics
Association, Vol. 26 No. 2, pp. 159-161.
Foster, J., McLeod, J., Nolin, J. and Greifeneder, E. (2018), “Data work in context: value, risks, and
governance”, Journal of the Association for Information Science and Technology, Vol. 69 No. 12,
pp. 1414-1427.
Ghasemaghaei, M., Ebrahimi, S. and Hassanein, K. (2018), “Data analytics competency for improving
firm decision making performance”, The Journal of Strategic Information Systems, Vol. 27 No. 1,
pp. 101-113.
Ghosh, J. (2016), “Big data analytics: a field of opportunities for information systems and technology
researchers”, Journal of Global Information Technology Management, Vol. 19 No. 4, pp. 217-222.
Gollub, T., Lipka, N., Koh, E., Genc, E. and Stein, B. (2016), “Topical sequence profiling”, 2016 27th
International Workshop on Database and Expert Systems Applications (DEXA), IEEE,
pp. 207-211.
Granville, V. (2014), Developing Analytic Talent: Becoming a Data Scientist, John Wiley & Sons,
Hoboken, NJ.
Greenberg, J. (2017), “Big metadata, smart metadata, and metadata capital: toward greater synergy
between data science and metadata”, Journal of Data and Information Science, Vol. 2 No. 3,
pp. 19-36.
Guo, J., Zhang, W., Fan, W. and Li, W. (2018), “Combining geographical and social influences with
deep learning for personalized point-of-interest recommendation”, Journal of Management
Information Systems, Vol. 35 No. 4, pp. 1121-1153.
Halim, Z. and Khan, S. (2019), “A data science-based framework to categorize academic journals”,
Scientometrics, Vol. 119 No. 1, pp. 393-423.
Hjørland, B. (2019), “Data (with big data and database semantics)”, KO Knowledge Organization,
Vol. 45 No. 8, pp. 685-708.
Intezari, A. and Gressel, S. (2017), “Information and reformation in KM systems: big data and strategic
decision-making”, Journal of Knowledge Management, Vol. 21 No. 1, pp. 71-91.
Kelleher, J.D. and Tierney, B. (2018), Data Science, MIT Press, Cambridge, MA.
Kennan, M.A. (2017), “‘In the eye of the beholder’: knowledge and skills requirements for data
professionals”, Information Research, Vol. 22 No. 4, available at: www.informationr.net/ir/22-4/
rails/rails1601.html (accessed 18 January 2019).
Kocheturov, A. and Pardalos, P.M. (2016), “Data science for massive networks”, in Braslavski, P.,
Markov, I., Pardalos, P., Volkovich, Y., Ignatov, D.I., Koltsov, S. and Koltsova, O. (Eds),
Information Retrieval. RuSSIR 2015. Communications in Computer and Information Science,
Springer, Cham, Vol. 573, pp. 88-100.
Koltay, T. (2019), “Accepted and emerging roles of academic libraries in supporting Research 2.0”, The
Journal of Academic Librarianship, Vol. 45 No. 2, pp. 75-80.
Ku, J.P., Hicks, J.L., Hastie, T., Leskovec, J., Re, C. and Delp, S.L. (2015), “The Mobilize Center: an NIH
big data to knowledge center to advance human movement research and improve mobility”,
Journal of the American Medical Informatics Association, Vol. 22 No. 6, pp. 1120-1125.
Kumar, S., Abowd, G.D., Abraham, W.T., al’Absi, M., Gayle Beck, J., Chau, D.H., Condie, T., Conroy, Data science
D.E., Ertin, E., Estrin, D. and Ganesan, D. (2015), “Center of excellence for mobile sensor data-to-
knowledge (MD2K)”, Journal of the American Medical Informatics Association, Vol. 22 No. 6, and
pp. 1137-1142. information
Larson, D. and Chang, V. (2016), “A review and future direction of agile, business intelligence, science
analytics and data science”, International Journal of Information Management, Vol. 36 No. 5,
pp. 700-710.
Ledley, T.S., Dahlman, L., Domenico, B. and Taber, M.R. (2005), “Facilitating the effective use of earth 661
science data in education through digital libraries: bridging the gap between scientists and
educators”, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries
(JCDL’05), IEEE, p. 386.
Leung, C.K., Braun, P., Enkhee, M., Pazdor, A.G., Sarumi, O.A. and Tran, K. (2016), “Knowledge
discovery from big social key-value data”, 2016 IEEE International Conference on Computer
and Information Technology (CIT), IEEE, pp. 484-491.
Lorentzen, D.G. and Nolin, J. (2017), “Approaching completeness: capturing a hashtagged Twitter
conversation and its follow-on conversation”, Social Science Computer Review, Vol. 35 No. 2,
pp. 277-286.
Mandal, S. (2019), “The influence of big data analytics management capabilities on supply chain
preparedness, alertness and agility: an empirical investigation”, Information Technology and
People, Vol. 32 No. 2, pp. 297-318.
Margolis, R., Derr, L., Dunn, M., Huerta, M., Larkin, J., Sheehan, J. and Green, E.D. (2014), “The national
institutes of health’s big data to knowledge (BD2K) initiative: capitalizing on biomedical big
data”, Journal of the American Medical Informatics Association, Vol. 21 No. 6, pp. 957-958.
Maxwell, D., Norton, H. and Wu, J. (2018), “The data science opportunity: crafting a holistic strategy”,
Journal of Library Administration, Vol. 58 No. 2, pp. 111-127.
Mitchell, E.T. (2015), “Reproducibility and its application to technical service processes”, Technical
Services Quarterly, Vol. 32 No. 4, pp. 402-413.
Newman, R., Chang, V., Walters, R.J. and Wills, G.B. (2016), “Model and experimental development for
business data science”, International Journal of Information Management, Vol. 36 No. 4,
pp. 607-617.
Ohno-Machado, L. (Ed.) (2013), “Data science and informatics: when it comes to biomedical data, is
there a real distinction?”, Journal of the American Medical Informatics Association, Vol. 20 No. 6,
p. 1009.
Ohno-Machado, L. (2018a), “Special focus on biomedical data science”, Journal of the American
Medical Informatics Association, Vol. 25 No. 1, p. 1.
Ohno-Machado, L. (2018b), “Data science and artificial intelligence to improve clinical practice and
research”, Journal of the American Medical Informatics Association, Vol. 25 No. 10, p. 1273.
Ortiz-Repiso, V., Greenberg, J. and Calzada-Prado, J. (2018), “A cross-institutional analysis of data-
related curricula in information science programmes: a focused look at the iSchools”, Journal of
Information Science, Vol. 44 No. 6, pp. 768-784.
Park, H.W. and Leydesdorff, L. (2013), “Decomposing social and semantic networks in emerging “big
data” research”, Journal of Informetrics, Vol. 7 No. 3, pp. 756-765.
Poulova, P., Mikulecka, J., Kozel, T. and Klimova, B. (2018), “Data science study program”, 12th
International Scientific Conference on Distance Learning in Applied Informatics (DIVAI),
pp. 337-347.
Qasim, M.A., Ul Hassan, S., Aljohani, N.R. and Lytras, M.D. (2017), “Human behavior analysis in the
production and consumption of scientific knowledge across regions: a case study on
publications in Scopus”, Library Hi Tech, Vol. 35 No. 4, pp. 577-587.
DTA Rempel, E.S., Barnett, J. and Durrant, H. (2018), “Public engagement with UK government data
science: propositions from a literature review of public engagement on new technologies”,
54,5 Government Information Quarterly, Vol. 35 No. 4, pp. 569-578.
Rentier, B. (2016), “Open science: a revolution in sight?”, Interlending and Document Supply, Vol. 44
No. 4, pp. 155-160.
Saar-Tsechansky, M. (2015), “Editor’s comments: the business of business data science in IS journals”,
MIS Quarterly, Vol. 39 No. 4, pp. iii-vi.
662
Saltz, J., Shamshurin, I. and Connors, C. (2017), “Predicting data science sociotechnical execution
challenges by categorizing data science projects”, Journal of the Association for Information
Science and Technology, Vol. 68 No. 12, pp. 2720-2728.
Schweighofer, E. (2015), “The role of AI & law in legal data science”, in Rotolo, A. (Ed.), Legal
Knowledge and Information Systems, JURIX 2015: The Twenty-Eight Annual Conference, IOS
Press, Amsterdam, pp. 191-192.
Sexton, A., Shepherd, E., Duke-Williams, O. and Eveleigh, A. (2017), “A balance of trust in the use of
government administrative data”, Archival Science, Vol. 17 No. 4, pp. 305-330.
Sheble, L. (2016), “Research synthesis methods and library and information science: shared problems,
limited diffusion”, Journal of the Association for Information Science and Technology, Vol. 67
No. 8, pp. 1990-2008.
Shera, J.H. (1951), “Documentation: its scope and limitations”, The Library Quarterly, Vol. 21 No. 1,
pp. 13-26.
Si, L., Zhuang, X., Xing, W. and Guo, W. (2013), “The cultivation of scientific data specialists:
development of LIS education oriented to e-science service requirements”, Library Hi Tech,
Vol. 31 No. 4, pp. 700-724.
Song, I.Y. and Zhu, Y. (2017), “Big data and data science: opportunities and challenges of iSchools”,
Journal of Data and Information Science, Vol. 2 No. 3, pp. 1-18.
Song, P., Zheng, C., Zhang, C. and Yu, X. (2018), “Data analytics and firm performance: an empirical
study in an online B2C platform”, Information and Management, Vol. 55 No. 5, pp. 633-642.
Spruit, M. and Lytras, M. (2018), “Applied data science in patient-centric healthcare: adaptive analytic
systems for empowering physicians and patients”, Telematics and Informatics, Vol. 35 No. 4,
pp. 643-653.
Stanton, J.M., Palmer, C.L., Blake, C. and Allard, S. (2012), “Interdisciplinary data science education”,
Special Issues in Data Management (ACS Symposium Series, Vol. 1110), Washington, DC,
American Chemical Society.
Sundararajan, A., Provost, F., Oestreicher-Singer, G. and Aral, S. (2013), “Information in digital,
economic, and social networks”, Information Systems Research, Vol. 24 No. 4, pp. 883-905.
Tang, R. and Sae-Lim, W. (2016), “Data science programs in US higher education: an exploratory
content analysis of program description, curriculum structure, and course focus”, Education for
Information, Vol. 32 No. 3, pp. 269-290.
Thirathon, U., Wieder, B., Matolcsy, Z. and Ossimitz, M.L. (2017), “Big data, analytic culture and
analytic-based decision making evidence from Australia”, Procedia Computer Science, Vol. 121,
pp. 775-783.
Ting, I. (2015), “Developing analytic talent: becoming a data scientist”, Online Information Review,
Vol. 39 No. 2, p. 273.
Umachandran, K. and Ferdinand-James, D.S. (2017), “Affordances of data science in agriculture,
manufacturing, and education”, in Tamane, S. (Ed.), Privacy and Security Policies in Big Data,
IGI Global, pp. 14-40.
Virkus, S. and Garoufallou, E. (2019a), “Data science from a library and information science
perspective”, Data Technologies and Applications, Vol. 53 No. 4, pp. 422-441, doi: 10.1108/DTA-
05-2019-0076.
Virkus, S. and Garoufallou, E. (2019b), “Data science from a perspective of computer science”, in Data science
Garoufallou, E., Fallucchi, F. and William De Luca, E. (Eds), Metadata and Semantic Research.
MTSR 2019. Communications in Computer and Information Science, Springer, Cham, Vol. 1057, and
pp. 209-219, doi: 10.1007/978-3-030-36599-8_19. information
Waltl, B., Zec, M. and Matthes, F. (2015), “A data science environment for legal texts”, in Rotolo, A. science
(Ed.), Legal Knowledge and Information Systems. JURIX 2015: The Twenty-Eight Annual
Conference, IOS Press, Amsterdam, pp. 193-194.
Wang, K. (2018), “Twinning data science with information science in schools of library and 663
information science”, Journal of Documentation, Vol. 74 No. 6, pp. 1243-1257.
Wilson, T.D. (2018), “Review of: Kelleher, John D. and Tierney, Brendan. Data science. Cambridge,
MA: MIT Press, 2018”, Information Research, Vol. 23 No. 2, available at: informationr.net/ir/
reviews/revs630.html (accessed 12 July 2020).
Xia, W., Wan, Z., Yin, Z., Gaupp, J., Liu, Y., Clayton, E.W. and Malin, B.A. (2017), “It’s all in the timing:
calibrating temporal penalties for biomedical data sharing”, Journal of the American Medical
Informatics Association, Vol. 25 No. 1, pp. 25-31.
Yousafzai, A., Chang, V., Gani, A. and Noor, R.M. (2016), “Directory-based incentive management
services for ad-hoc mobile clouds”, International Journal of Information Management, Vol. 36
No. 6, pp. 900-906.
Zhou, X., Li, W. and Arundel, S.T. (2018), “A spatio-contextual probabilistic model for extracting
linear features in hilly terrains from high-resolution DEM data”, International Journal of
Geographical Information Science, Vol. 33 No. 4, pp. 666-686.
Zoltan, G. (2016), “Big data, science, causality”, Informacios Tarsadalom, Vol. 16 No. 2, p. 32.

About the authors


Dr Sirje Virkus is a Professor of Information Science at the School of Digital Technologies of Tallinn
University, Estonia. She holds a PhD in Information and Communication Studies from the Manchester
Metropolitan University. She is the Head of the Study Area of Information Sciences at Tallinn
University. She has an extensive experience working with educational innovation and research in the
higher education sector in Estonia. She has actively participated in many national and international
projects as a coordinator and a partner. Her research interests are focused on the development of
information-related competencies (data, information, media and digital competencies), information and
communications technology innovation in education, library and information science education and
internationalization. She has written more than 170 research publications, edited several books and has
been an invited speaker at international conferences. She belongs to the editorial board of several high-
quality scientific journals (“Information Research,” “Global Knowledge, Memory and Communication,”
“Nordic Journal of Information Literacy in Higher Education”) and conference program committees.
Sirje Virkus is the corresponding author and can be contacted at: sirje.virkus@tlu.ee
Dr Emmanouel Garoufallou is an Assistant Professor of the Department of Library Science, Archives
and Information Systems, School of Social Sciences, International Hellenic University, Thessaloniki,
Greece. He holds a Master’s Degree in Library and Information Management from Northumbria
University and a PhD in Digital Libraries from Manchester Metropolitan University in the UK. He has
been a Project Manager and Research Associate of various EC projects. Since 2013, he serves as the
General Chair and co-Chair of the Metadata and Semantics Research (MTSR) International Conference.
He has published extensively; he has edited several books in English (Springer Conference Proceedings).
He serves and served as an editorial board member of various international conferences and journals
such as the “Education for Information” and “International Information and Library Review.” Currently,
he serves as the Topic Area Editor – “Open and social data: data sharing” of the journal Data
Technologies and Applications. He serves as the Executive Editor of the International Journal of
Metadata, Semantics and Ontologies.

For instructions on how to order reprints of this article, please visit our website:
www.emeraldgrouppublishing.com/licensing/reprints.htm
Or contact us for further details: permissions@emeraldinsight.com

You might also like