You are on page 1of 11

contributed articles

DOI:10.1145/ 3582491
economic impact. A 2015 Organisa-
While the success of early data-science tion for Economic Co-operation and
Development (OECD) report identi-
applications is evident, the full impact fied “data-driven innovation” (DDI)
of data science has yet to be realized. as having a central driving role in 21st
century economies, defining DDI as
BY M. TAMER ÖZSU “the use of data and analytics to im-
prove and foster new products, pro-

Data
cesses, organisational methods and
markets.” Data science deployments
are still what might be called first
generation, but their impact is al-

Science—
ready being felt in many areas: global
sustainability,11 power and energy
systems,25 biological and biomedical
systems,38 health sciences and health

A Systematic
informatics,12 finance and insurance,8
smart cities,33 digital humanities,28
and more.
The last decade has established the

Treatment
terms “big data,” “data analytics,” and
“data science” into our lexicon, both
as buzzwords and as important fields
of study. Interest in the topic, as evi-
denced by Google Trends (see Figure
1), has exploded over the same pe-
riod. An increasing number of coun-
tries have released policy statements
related to data science. In academia,
data-science programs and research
institutes have been established with
significant speed, while many indus-
THERE IS A data-driven revolution under way in science trial organizations have created data-
science units. A quick survey of these
and society, disrupting every form of enterprise. We programs and initiatives suggests a
are collecting and storing data more rapidly than common core but also a lack of uni-
ever before. The value of data as a central asset in an fied and clear framing of data science.
There are several reasons why
organization is now well established and generally clarification is helpful. One is to be
accepted. The Economist called data “the world’s
most valuable resource.”40 The World Economic key insights
Forum’s briefing paper, A New Paradigm for Business ˽ Establishing data science as a field
of study or an academic discipline
of Data, states “At the heart of digital economy and requires identification of its foundations
society is the explosion of insight, intelligence and and scope.
˽ Data science is not a proper subset of AI
information—data.”5 or of statistics. It is broader than data
analytics using statistical or machine-
The field of data science is expected to enable data to learning techniques.
be leveraged for making better decisions and achieving ˽ Data science is highly interdisciplinary
and involves STEM and non-STEM
more meaningful outcomes. Although the term data
IMAGE BY ARTH EA D

disciplines.
science has some history, in its current incarnation as ˽ Applications are crucial for data science,
without which there is only big-data
a modern field of study, it has already had significant management and processing.

106 CO M MUNICATIO NS O F TH E AC M | J U LY 2023 | VO L . 66 | NO. 7


contributed articles

able to understand whether data sci- onymous with these disciplines. More The objective of this article is to
ence is an academic discipline. This broadly, data science is not a subtopic put forth an internally consistent and
is hard to know without a definition of AI—a common claim originating coherent view of data science. The
of data science and an identifica- from confusion on boundaries. AI and article discusses core components,
tion of its core and scope. A related data science are conceptually differ- contributing disciplines, life cycle
reason is to provide an intellectually ent fields that overlap when ML/DM considerations, and shareholder
consistent framing to the numerous techniques are used in data analytics communities. The main takeaways
data-science institutes and academic but otherwise have their own broader can be summarized as follows:
units being formed. A third reason is concerns. The broader scope of data ˲ It is important to clearly establish
to bring some clarity to the question science is discussed in this article, a consistent and inclusive view of the
of who a data scientist is. The point highlighting its constituents that are entire field, and one is proposed.
is not to constrain what is meant by not part of AI. Conversely, there are ˲ To avoid becoming a catch-all or
a data scientist or to limit the scope topics in AI, such as agents, robotics, whatever particular circumstances al-
of current academic initiatives but automated programming, and others, low, it is essential to define the core
to acknowledge the diversity around that are not within the scope of data of the field while being inclusive, and
some commonalities. The final rea- science. Thus, AI and data science are four core areas are identified.
son is that a systematic investigation related, but one does not encompass ˲ It is critical to take a holistic view
of the field is likely to identify impor- the other. of activities that comprise data sci-
tant techniques and tools that should A final difficulty is that data sci- ence.
be in a professional data scientist’s ence is broad in scope, involving nu- ˲ A framework must be established
toolbox. merous disciplines, and finding the to facilitate cooperation and col-
Part of the difficulty is the care- right synthesis is not straightforward. laboration among a number of disci-
lessly interchangeable use of the At the risk of oversimplification, the plines.
terms “big data,” “data analytics,” following are the different constitu- Data science is still in its early days
and “data science” in much of the encies that have an interest in data as an emerging field. This article con-
popular literature, which frequently science: STEM people who focus on tributes to discussions around its na-
spills over to technical literature. It foundational techniques and under- ture and scope. There will, hopefully,
is important to get them right. Data lying principles (computer scientists, be joinders to the discussion to better
analytics, as defined in the next two mathematicians, statisticians); STEM define the field.
sections, is a component of data sci- people who focus on science and engi-
ence and not synonymous with it. neering data-science applications (for What Is Data Science?
Data science is not the same as big example, biologists, ecologists, earth The origins of the term data science
data. Perhaps the best analogy be- and environmental scientists, health are fuzzy. Data is central to both sta-
tween them is that big data is like raw scientists, engineers); and non-STEM tistics and computing, so both com-
material; it has considerable promise people who focus on social, political, munities have tried to define the field.
and potential if one knows what to do and societal aspects. It is important Statisticians suggest that its origins
with it. Data science gives it purpose, to include all these constituencies in lead to John Tukey,41 who passionately
specifying how to process it to extract discussions surrounding data science argued in the 1960s for the separation
its full potential and to what end. It while establishing a recognizable of “data analysis” from “classical sta-
does this typically in an application- core of the field. This is a difficult bal- tistics.” His main point was that data
driven manner, allowing applications ance to maintain. analysis is an empirical science while
to frame the study objective and ques-
tion. Applications are central to data Figure 1. Trending of data science-relevant terms.
science; if there are no applications
to drive the inquiry, it is hard to argue Big Data Data Analytics Data Science Total
that there is a data-science deploy- 300

ment. Jagadish also emphasizes this 250


point, stating “‘Big Data’ begins with
the data characteristics (and works up 200
from there), whereas ‘Data Science’
begins with data use (and works down 150

from there).”22
100
A second difficulty is the vagueness
in many definitions about the rela- 50
tionship between data science, ma-
chine learning (ML), and data mining 0
Jan 04
May 04
Sep 04
Jan 05
May 05
Sep 05
Jan 06
May 06
Sep 06
Jan 07
May 07
Sep 07
Jan 08
May 08
Sep 08
Jan 09
May 09
Sep 09
Jan 10
May 10
Sep 10
Jan 11
May 11
Sep 11
Jan 12
May 12
Sep 12
Jan 13
May 13
Sep 13
Jan 14
May 14
Sep 14
Jan 15
May 15
Sep 15
Jan 16
May 16
Sep 16
Jan 17
May 17
Sep 17
Jan 18
May 18
Sep 18
Jan 19
May 19
Sep 19
Jan 20
May 20
Sep 20
Jan 21
May 21
Sep 21
Jan 22
May 22
Sep 22
Jan 23

(DM)— this arises from the colloquial


use of “data science” to mean data an-
Source: Google Trends, March 2022.
alytics using ML/DM. Data science is
not a subfield of ML/DM nor is it syn-

108 CO M MUNICATIO NS O F TH E AC M | J U LY 2023 | VO L . 66 | NO. 7


contributed articles

classical statistics is pure mathemat- (also called the third paradigm), where A working comprehensive defini-
ics. Tukey defines data analysis as computational methods could replace tion that captures the essence of the
“procedures for analyzing data, tech- or enhance laboratory experimental field and explicitly recognizes that
niques for interpreting the results methods (a 2001 New York Times ar- it involves a process would be: Data
of such procedures, ways to plan the ticle declared “all science is computer science is a data-based approach to
gathering of data to make its analysis science,”23) rather quickly changed to problem solving by analyzing and
easier, more precise or more accurate, data-intensive methods. This is fre- exploring large volumes of possibly
and all the machinery and results of quently referred as the fourth para- multi-modal data, extracting knowl-
(mathematical) statistics which apply digm,19 and data science systematizes edge and insight from it, and using in-
to analyzing data.” Capturing a precise this understanding. formation for better decision-making.
definition of data has been important There are significant differences It involves the process of collecting,
from the start of computing as a dis- between what was called data analy- preparing, managing, analyzing, ex-
cipline. The International Federation sis (or analytics) and what the cur- plaining, and disseminating the data
of Information Processing’s (IFIP) rent understanding of data science and analysis results. This is consistent
definition of data is “a representation entails. More modern definitions of with the current understanding of the
of facts or ideas in a formalized man- data science encompass this broader broad scope of the field—see, for ex-
ner capable of being communicated interpretation—for example, “Data ample, the CRA’s use of the term.17
or manipulated by some process.”20 science encompasses a set of prin- Note that the definition inten-
Naur builds on this definition: “Data ciples, problem definitions, algo- tionally uses the term “data-based”
science is the science of dealing with rithms, and processes for extract- rather than the more common “da-
data, once they have been established, ing non-obvious and useful patterns ta-driven.” The latter has frequently
while the relation of data to what they from large data sets.”26 The National been interpreted as “data should be
represent is delegated to other fields Consortium for Data Science (NCDS), the main (only?) basis for decisions”
and sciences.”30 a U.S. consortium of leaders from since “data speaks for itself.” This is
Clearly, both statisticians and com- academia, industry, and government, wrong—data certainly holds facts and
puter scientists have been thinking defines data science as the “system- can reveal a story, but it only speaks
about data science for a long time, and atic study of organization and use of through those who interpret it, and
the understanding of what it is has digital data for research discoveries, that can potentially introduce biases.
evolved over time. A subtle shift hap- decision-making, and data-driven Therefore, data should be one of the
pened in the 2000s with the recogni- economy.”2 These are not sufficient- inputs to decision making, not the
tion that data science is broader than ly precise to define a field, but they only one. Furthermore, “data-driven”
data analytics and that it involves a capture some important common has come to mean that it is possible to
process from data ingestion to the pro- themes: interdisciplinarity (as de- take data and analyze it by automated
duction of insights and actionable rec- fined in Alvargonzález4 and Choi tools to generate automated actions.
ommendations. During this period, and Pak9), a data-based approach to This is also problematic despite the
data-intensive approaches to problem problem solving, the use of large and recent popularity of, and over-reli-
solving started to produce results. This multi-modal data, the focus on deriv- ance on, predictive and prescriptive
became known as the ‘big-data revolu- ing insights and value by discovering analytics. Though data science has
tion’ and resulted in data-centric ap- patterns and relationships in data, significant potential, and successful
proaches in many fields. What initially and the underlying process-oriented data-science applications are plenti-
began as a computational paradigm life cycle. ful, there are sufficient misuses of

Figure 2. Data-science building blocks.

Applications
Data Science Building Blocks

Data Engineering Data Analytics Data Protection Data Ethics


 Big data management  Explore data  Security for data science  Impact on individuals,
 Data preparation (data mining)  Data privacy organizations, and society
 Build models  Ethical and normative
and algorithms concerns
(machine learning)  Bias in data
 Data preparation  Algorithmic bias
 Regulatory issues

Social and Policy Context

JU LY 2 0 2 3 | VO L. 6 6 | N O. 7 | C OM M U N IC AT ION S OF T H E ACM 109


contributed articles

data to give us pause and concern: tion development and deployment.


Google’s algorithmic detection of These data characteristics are quite
influenza spread using social media different than those that traditional
data is one prominent example, while data-management systems are de-
the Risk Needs Assessment Test used
in the U.S. justice system is another. There are signed for, requiring new systems,
methodologies, and approaches.
Therefore, “data-based” is the pref-
erable phrase that signals that data-
significant What is needed is a data-manage-
ment platform that provides appro-
science deployments are aids to the differences priate functionality and interfaces for
decision maker, not decision makers
themselves.a
between what was conducting data analysis, executing
declarative queries, and enabling so-
called data analysis phisticated search. This exceeds the
Data-Science Ecosystem
Data science is inherently interdisci-
(or analytics) and current state of the art, where individ-
ual systems are specialized toward a
plinary: It builds on a core set of capa- what the current specific data model and require meta-
bilities in data engineering, data ana-
lytics, data protection, and ethics—the understanding of models that allow for different levels
of abstraction, and for more fluent
four pillars of data science (see Figure data science entails. integration and seamless interoper-
2). Some of this core is technical, some ability. Data preparation is typically
is not. Although the term “data sci- understood to be the process of data-
ence” is frequently used to refer only to set selection, data acquisition, data
data analysis, the scope is wider, and integration, and data-quality enforce-
the contributing elements of the field ment. Applying appropriate analysis
should be properly recognized. The to the integrated data will provide
core is in close interaction with ap- new insights that can improve orga-
plication domains that have the dual nizational effectiveness and efficien-
function of informing the appropriate cy and result in evidence-informed
technologies, tools, algorithms, and policies. However, for this analysis
methodologies that should be useful to yield meaningful results, the input
to develop and leverage these capa- data must be appropriately prepared
bilities to solve their problems. Data- and trustworthy. The quality of the
science application deployments are analysis model makes little differ-
highly sensitive to existing social and ence; if the input data is not clean and
policy context, and these influence trustworthy, then the results will not
both the core technologies and the ap- very valuable. The adage of “garbage
plication deployments. in, garbage out” is real in data sci-
Data engineering. Data is at the ence. Data quality is an essential ele-
core of data science; the type of data ment in making the data trustworthy
that is used is commonly referred to for analysis. It addresses the veracity
as big data. There is no universal defi- characteristic of big data. Data quality
nition of big data; it is usually charac- is considered mission critical in data-
terized as data that is large (volume); science success and constitutes a ma-
multi-modal (variety) with many types jor portion of the data-preparation ef-
of data: structured, text, images, vid- fort for most organizations.
eo, and others; sometimes streaming An important vehicle of data qual-
at high speed (velocity); and has qual- ity and data trustworthiness is meta-
ity issues (veracity). These are known data and metadata management.
as the “four Vs” and addressing them One particularly important metadata
appropriately is the domain of big-da- that deserves mention is provenance,
ta management.31 Data engineering which tracks the source of the origi-
in data science addresses two main nal data. Another challenge is devel-
concerns: the management of big data oping and instituting the appropriate
(including the computing platforms system and tool support for managing
for its processing) and the preparation provenance, and tracking data as it
of data for analysis. goes through the processing pipeline.
Managing big data is challenging A very important aspect of data
but critical in data-science applica- quality is data cleaning.21 When data
from multiple sources is used, there
a Other suitable terms that have been used are are bound to be inconsistencies, er-
“data-enhanced” or “data-enabled”. rors in data, and missing informa-

110 COMM UNICATIO NS O F T H E ACM | J U LY 2023 | VO L . 66 | NO. 7


contributed articles

tion that must be corrected (cleaned). function that relates one or more in- The characteristics of data used in
Techniques and methodologies for dependent variables to a dependent data science pose unique challenges.
data cleaning are an important part variable; and summarization creates a Data volumes make the enforcement
of data engineering. more compact representation of the of access-control mechanisms more
Data analytics. Data analytics is dataset. difficult and the detection of mali-
the application of statistical and ML As discussed earlier, an important cious data and use more challenging.
techniques to draw insights from data source in data science is stream- The numerosity and variety of data
data under study and to make predic- ing data. In this case, real-time analyt- sources make it possible to inject mis/
tions about the behavior of the sys- ics must be considered as data flows disinformation, skewing the analysis
tem under study. A first-level distinc- continuously. Real-time analytics is results.7 Data-science platforms are,
tion in data analysis is made between particularly difficult given that most by necessity, scale-out systems that
inference and prediction. Inference analysis algorithms are computation- increase the possibility of infrastruc-
is based on building a model that ally heavy and usually require mul- ture attacks. These environments
describes a system behavior by rep- tiple passes over the dataset, which is also increase the potential for sur-
resenting the input variables and re- challenging in streaming data. veillance. The variability and poten-
lationships among them. Prediction In a data-science project, an im- tially high numbers of end users, and
goes further and identifies the cours- portant consideration is the selection in many data-science deployments,
es of action that might yield the “best” of appropriate techniques for the task the need for openness for sharing
outcomes. This classification can be and how they can be leveraged. Given analysis results and for bolstering the
made more finely grained by identify- the societal impact of data-science analysis, opens the possibility of data
ing four different classes: descriptive, applications and deployments, the breaches and misuse. These factors
which retrospectively looks at the his- explainability of the analysis results seriously increase the threats and the
torical data to answer the questions is equally important. attack surface. Therefore, protection
“What happened?” or “What does Data protection. Data science’s is required for the entirety of the data-
the data tell us?; diagnostic, which is reliance on large volumes of varied science life cycle, from data acquisi-
also retrospective but goes beyond data from many sources raises impor- tion to the dissemination of results,
descriptive to answer the question tant data-protection concerns. The as well as for secure archiving or dele-
“Why has that happened?”; predictive, scale, diversity, and interconnected- tion. An implicit goal of data science
a forward-looking analysis of histori- ness of data (for example, in online is to gain access to as much data as
cal data that provides calculated pre- social networks) requires revisiting possible, which directly conflicts with
dictions of what is likely to happen; data-protection techniques that have the fundamental least-privilege secu-
and prescriptive, which goes further been mostly developed for corporate rity principle of providing access to as
by recommending courses of action. data.7,29 few resources as necessary. Closing
Predictive and prescriptive analytics It is customary to discuss the rele- this gap includes careful redesign and
together are usually called advanced vant issues under the complementary advancement of security technologies
analytics. The relationship among topics data security and data privacy. to preserve the integrity of scientific
these is usually evaluated along two The former protects information results, data privacy, and to comply
dimensions: complexity and value.27 from unauthorized access or mali- with regulations and agreements gov-
Going from descriptive to prescrip- cious attacks, while the latter focuses erning data access. Techniques that
tive, analysis becomes far more com- on the rights of users and groups over have been developed for privacy-pre-
plex, but the value derived from it also data about themselves. Data security serving data mining are examples of
substantially increases. typically deals with data confidenti- this consideration.
There are six data-analysis tasks ality, access control, infrastructure Data-science ethics. The fourth
(methods) commonly used in data security, and system monitoring, and building block of data science is eth-
science:15,24 clustering, which finds uses technologies such as encryption, ics. In many discussions, ethics is
meaningful groups or collections trusted execution environments, and bundled with a discussion of data pri-
of data based on the “similarity” of monitoring tools. Data privacy, on the vacy. The two topics certainly have a
data points (data points in the same other hand, deals with privacy poli- strong relationship, but they should
cluster are more similar to each other cies and regulations, data retention be considered separate pillars of the
than they are to data points in other and deletion policies, data-subject data-science core. Literature typically
clusters); outlier detection, which access requirement (DSAR) policies, refers to data ethics as “. . . the branch
refers to identification of rare data management of data use by third par- of ethics that studies and evaluates
items in a dataset that differ signifi- ties, and user consent. Data privacy moral problems related to data, . . .
cantly from the majority of the data; normally involves privacy-enhancing algorithms, . . . and corresponding
association rule learning discovers in- technologies (PETs). Although re- practices, in order to formulate and
teresting relationships between vari- search on these topics is usually iso- support morally good solutions.”16
ables in a large dataset; classification lated, it is helpful to take a holistic The definition recognizes the three
finds a function (model) that places a and broader view, hence the term dimensions of the issue—data, algo-
given data item in one of a set of pre- data protection is more appropriate rithms, and practice.
defined classes; regression finds the and informative. ˲ The ethics of data refers to the eth-

JU LY 2 0 2 3 | VO L. 6 6 | N O. 7 | C OM M U N IC AT ION S OF T H E ACM 111


contributed articles

ical problems posed by the collection cant attention is bias. Oxford English has focused on the problem of bias,
and analysis of large datasets and on Dictionary defines bias as the “incli- data-science ethics should be consid-
issues arising from the use of big data nation or prejudice for or against one ered more generally consistent with
in a diverse set of applications. person or group, especially in a way the previously offered, broader defi-
˲ The ethics of algorithms addresses considered to be unfair.” Bias is in- nition of ethics. Some of the broader
concerns arising from the increas- herent in human activities and deci- ethical considerations have overlap-
ing complexity and autonomy of al- sion making, and human biases are ping concerns with data protection.
gorithms, their fairness, bias, equity, reflected in data science as biases in Related to ethics of practice, societal
validity, and reliability.18 data and biases in algorithms. Bias in norms that usually get coded in leg-
˲ The ethics of practices addresses data can be introduced through what islation are important (more on this
questions concerning the responsi- is included in the historical data used in the next section). Consequently,
bilities and liabilities of people and by the algorithms—for example, ar- while some of the ethical concerns
organizations in charge of data pro- rest records in the U.S. include more are universal, others can be specific
cesses, strategies, and policies. The marginalized communities, primarily to a jurisdiction.
growing research in AI ethics tackles because they are over-policed. Bias in Broader ethical concerns also in-
many of these issues. data may also be introduced due to clude: ownership of data; transparen-
Perhaps one of the most impor- under-representation—for example, cy, which refers to subjects knowing
tant concepts in data-science ethics data used in face-recognition sys- what data is collected about them and
is informed consent. Participants of tems comprises 80% whites, of which how it will be stored and processed
data-science projects should have three-quarters are males. Algorithmic (including informed consent by the
full information about the project, bias can occur due to the inclusion or subject); privacy of personal data, in
its objectives, and scope, and they omission of features in the algorithm/ particular the revealing of personal
should freely agree to participate. If model. In ML deployments, this can identifiable information; and inten-
data about participants is collected, occur during feature engineering. tion regarding how data will be used,
they should have full knowledge of These features include individual at- especially for secondary use.
what is being collected and how it tributes such as race, religion, and Social and policy context. As noted
will be used (including by third par- gender. The use of proxy metrics (for earlier, data-science deployments
ties) so they can agree to its collection instance, using standardized exami- are highly sensitive to the societal
and use. nation scores to predict student suc- and policy contexts in which they
One important issue in data-sci- cess) can also lead to bias. are deployed. For example, what can
ence ethics that has received signifi- Although considerable attention be done with data differs in differ-
ent jurisdictions. The context can be
Figure 3. Data-science life cycle. legal, establishing legal norms for
data-science deployments, or it can
be societal, identifying what is so-
Data Protection cially acceptable. Furthermore, there
are significant intersections between
social science and humanities and
the core issues in data science. There
are four central concerns: ownership,
Research
Question representation, regulation, and pub-
lic policy. Obviously, there is overlap
between these and the data-ethics
Deployment and concerns previously discussed.
Data
Dissemination Preparation Ownership. Data ownership, ac-
cess, and use—particularly in terms
of how individual data is generated,
who owns and can access to it, and
who profits from it—is a critical con-
cern. At the societal and organiza-
tional levels, researchers analyze how
Data Data Storage economic systems are increasingly
Analysis and Management
data-dependent in terms of both op-
erations and revenue streams, and
how pressures to collect and share
Da s ever-more intimate data may conflict
ta ue
Eth Iss with users’ own calls for privacy and
ic s, Social and Policy
autonomy. Data privacy from a tech-
nical perspective was previously dis-
cussed, but it obviously has a signifi-

112 CO M MUNICATIO NS O F TH E ACM | J U LY 2023 | VO L . 66 | NO. 7


contributed articles

cant legal and social dimension that and analyze this data. With the nec-
requires careful study (for example, essary tools, this data could be man-
Solove35). aged and analyzed in a way that is
Representation. A primary concern explainable and that can be dissemi-
in the development of data-science
technologies is ensuring diverse and Who “owns” data nated in a meaningful way to provide
key insights. In a similar vein, there
equitable representation at all stag-
es of the life cycle. This includes the
science is a topic of is a paradox in the large amount of
“open” data that remains unused,
evaluation of the training, tools, and some discussion, along with concerns of a data deficit
techniques used in data science, in-
cluding who designs them, who has
primarily between in terms of more information that
could and should be collected to in-
access to them, and who is repre- statisticians and form public policy.
sented by them. Data representative-
ness is tightly tied to questions of
computer scientists. Data-Science Life Cycle
marginalization and bias that appear The definition of data science previ-
throughout the design, data collec- ously given clearly identifies the pro-
tion, analysis, and implementation cess view of data science, namely that
of these technologies—for example, it consists of several processing stag-
Richardson et al.32 Another concern es, starting from data ingestion and
is how data is increasingly used to eventually leads to better decisions,
“speak” for users, often without their insights, and action. That process is
knowledge—changing individuals’ called the data-science life cycle. Lit-
relationships with their local commu- erature refers to the data life cycle, fo-
nities, corporations, and state. cusing only on data processing. A good
Regulation and accountability. definition of the data life cycle is given
Components of ethical data science by the U.S. National Science Founda-
also include a commitment to trans- tion Working Group on the Emergence
parency and explainability in terms of Data Science,6 which identifies five
of analyzing the inputs of data-driven linear stages: acquiring the data, clean-
decision making, the algorithms ap- ing it and preparing it for analysis, us-
plied, and determining how they lead ing the data through analysis, publish-
to specific outputs and recommenda- ing the data and the methods used to
tions. Ensuring that the benefits and analyze the data, and preserving/de-
opportunities afforded by advances stroying the data according to policy.
in data science equally benefit broad- Variations of this data-life cycle model
er society involves accountability and emerge in various proposals, some
regulatory practices at each stage of predating the above formulation—for
the data-science life cycle and are example, Agrawal et al.,2 Jagadish,22
not only centered on laws and policy and Stodden.37
interventions, such as the General This life cycle model and its vari-
Data Protection Regulation (GDPR) ants give the impression that the
and the Canadian Personal Informa- entire process is linear and unidi-
tion Protection and Electronic Docu- rectional. Real project development
ments Act (PIPEDA). They must also hardly works in a linear fashion. An
include efforts to include values-in- alternative model that is more it-
design, interventions for more ac- erative with built-in feedback loops
cessible and inclusive design, and has been proposed in the CRoss-
tools for ethical thought at the levels Industry Standard Process for Data
of training, education, and ongoing Mining (CRISP-DM) model for data-
daily practice. mining projects.34 CRISP places data
Public policy. There is a critical and at the center and specifies a cyclical
urgent need to integrate data science life cycle that is iterative and may be
into the analysis of public policy.36 In repeated over the lifetime of the proj-
an age where every Facebook, Twit- ect. PPDAC26 is similar to CRISP-DM
ter, and Instagram post is a data ob- for statistical analysis tasks. Micro-
servation that can be archived and soft Team Data Science Access Life
can become part of a historical da- cycle39 also emphasizes the iterative
taset that can inform public policy, nature of the process.
governments have been left behind The data-science life cycle proposed
in their ability to collect, aggregate, in this article (see Figure 3) derives

JU LY 2 0 2 3 | VO L. 6 6 | N O. 7 | C OM M U N IC AT ION S OF T H E ACM 113


contributed articles

from and is built on those iterative and private institutions are adopt-
models. It starts with the specifica- ing Open Data Principles stating that
tion of the research question that may data should not only be open but
come from a particular application should be complete, accurate, prima-
or may be an exploratory question. A
good understanding of the research Data science ry, and released in a timely manner.
These properties make this data very
question is important, since it nor-
mally drives the entire process. The
should be viewed valuable to data scientists, journal-
ists, and the public. When open data
next step is data preparation, which as a unifying force is used effectively, data scientists can
includes determining which datas-
ets are needed and available; select-
that connects explore and analyze public resources,
which allows them to question public
ing the appropriate datasets from several fields, some policy, create new knowledge and ser-
within this larger set; ingesting the
data; and addressing data-quality is-
of which are STEM vices, and discover new value for so-
cial, scientific, or business initiatives.
sues, including data cleaning and and some are not. Problems during the analysis
data provenance. The third step is phase may result in the process re-
the proper storage and management turning to either reformulating the
of the data, including big-data man- research question (it may be under-
agement. Specifically, data needs to specified or overspecified, making
be integrated, decisions need to be model building infeasible) or cycling
made about the storage structures back to data preparation if the model
for data for efficient access, appro- requires other or different data that
priate storage structures need to be has not been prepared. As noted earli-
chosen and designed, suitable ac- er, data-science deployments are not
cess interfaces must be specified, and “one-and-done.” Following deploy-
provisions need to be made for meta- ment, there must be constant moni-
data management, in particular for toring—perhaps the environment
provenance data. The prepared and changes, the data changes, or there
suitably stored data is then open for is a deeper understanding of the re-
analysis. In particular, the appropri- search question that results in its re-
ate statistical and ML model(s) is/are vision and improvement. Thus, the
selected/developed, feature engineer- process cycles as a dialectic process—
ing is performed to identify the most every time the process comes back to
appropriate model parameters, and the research question, we are at an el-
model validation studies are conduct- evated understanding of what needs
ed to determine the model’s suitabil- to be studied. It is important to recog-
ity. If model validation is successful, nize that the stages in the life cycle are
the next step is deployment and dis- not isolated; the boundaries between
semination, which involves different stages are fuzzy, and there are impor-
activities depending on the particu- tant and interesting issues that arise
lar project and application. In some at their intersections.
cases, the analysis and processing of There is a continuous bi-directional
data needs to be performed on a con- interaction between this life cycle and
tinuing basis, so deployment involves the data-protection issues. Similarly,
maintaining and monitoring the sys- the ethical concerns, social norms,
tem over time. In other cases, deploy- and policy framework impact each
ment may involve compilation and of the phases, sometimes even pre-
dissemination of analysis results and venting the initiation of specific
their explanation. Dissemination of data-science studies.
the analysis results, and sometimes
even the curated data, is an impor- Data Science Is Interdisciplinary
tant aspect of this phase. Open data, Who “owns” data science is a topic of
which is data that “anyone can freely some discussion, primarily between
access, use, modify, and share for any statisticians and computer scien-
purpose (subject, at most, to require- tists. This discussion bleeds into the
ments that preserve provenance and question of who data scientists are,
openness”b) is an important part of which then leads to different educa-
dissemination. Many governments tional models of data science. Given
the centrality of data to both disci-
b See http://opendefinition.org. plines, this discussion is perhaps not

114 CO MM UNICATIO NS O F T H E AC M | J U LY 2023 | VO L . 66 | NO. 7


contributed articles

surprising. The concern among stat- viewpoint and counters Conway, Figure 4. Unifying view of data science.
isticians regarding the field of data since he considers “algorithms and
science is long-standing. Given the techniques for processing large-scale
early promotion of data analytics as data efficiently as the center of data
Humanities
an important topic by Tukey, there science.” Ullman claims that the two
Machine/
is a strong feeling among statisti- big knowledge bases of data science Computing Statistical
cians that they own (or should own) are computer science and domain Learning
the topic. In a 2013 opinion piece, science (that is, the application do-
Davidian13 laments the absence of main), and their intersection is where Data Science
Application
Social
statisticians in a multi-institution data science resides. He sees, rightly, Sciences Domain
Expertise
data-science initiative and asks if ML as part of computer science. He
data science is not what statisticians argues, again rightly, that some of
Law Mathematical
do. She indicates that data science is ML is used for data science, but there Optimization
“described as a blend of computer are ML applications that are outside
science, mathematics, data visualiza- of data science. His diagram shows
tion, machine learning, distributed that there are aspects of data sci-
data management—and statistics,” ence that require computer-science from many disciplines. Creation and
almost disappointed that these dis- techniques which have nothing to use of knowledge are fundamental
ciplines are involved along with sta- do with machine learning—data en- human activities that span millennia.
tistics. Similarly, Donoho laments gineering, as previously discussed, This activity is at the core of what we
the current popular interest in data would fall in that category. I suspect can define as being our collective hu-
science, indicating that most statisti- that some of these points may not be man culture. Attempts to splinter the
cians view new data science programs controversial. Where the argument is core of human achievement through
as “cultural appropriation.”14 likely to be challenged is that, in his an attribute of ownership is, at its
There is a well-known argument view, mathematics and statistics “do best, a narrow parochial view, and, at
put forth by Conway10 on the nature of not really impact domain sciences its worst, ascendancy of self-interest
data science. He proposes three main directly” albeit their importance in and greed.
areas organized as a Venn diagram: computer science. Within computer Data science should be viewed as
hacking skills, mathematics and sta- science, there is the discussion re- a unifying force that connects sev-
tistics, and substantive experience. garding the relationship of AI/ML eral fields (see Figure 4), some of
The hacking skills he contends to with data science, with some indi- which are STEM and some are not.
be important are the ability “to ma- cating that data science is part of AI, I go back to my discussion about
nipulate text files at the command- which was addressed in this article’s stakeholders, who are diverse. Own-
line, understanding vectorized op- first section. ership arguments take place within
erations, thinking algorithmically.” A more balanced view has been put one stakeholder group—STEM peo-
Mathematics and statistics knowl- forth by Marina Vogt,c who indicates ple who focus on foundational tech-
edge, at the level of “knowing what data science as sitting at the inter- niques and the underlying principles.
an ordinary least squares regression section of computer science, math- Within this group, it is important to
is and how to interpret it” is required ematics and statistics, and domain recognize and accept that there are
to analyze data. The substantive ex- knowledge. This is more in line with communities with complementary
perience is about the research prob- the view put forward by ACM in its and sometimes overlapping interests:
lem that may come from an applica- curriculum proposal: “Data science computer scientists who bring exper-
tion domain or a specific research is an interdisciplinary endeavor be- tise in computational techniques/
project. The Conway diagram, as it tween computer science, mathemat- tools that can effectively deal with
has come to be known, has become ics, statistics, and applied areas such scale and heterogeneity, statisticians
popular in these ownership debates as natural sciences.”1 However, even who focus on statistical modeling for
by those who do not see a central role this viewpoint is very STEM-centric analysis, and mathematicians who
for computer science in data science, and leaves out many topics that are of have much to contribute with dis-
because Conway argues that hacking interest to data science as a field. crete and continuous optimization
skills have nothing to do with com- These discussions and the result- techniques and precise modeling of
puter science: “This, however, does ing controversies are not helpful or processes. However, this is only one
not require a background in comput- necessary; they do not move the data- stakeholder group; I have identified
er science—in fact, many of the most science agenda forward. No single two others. One danger in such a
impressive hackers I have met never community “owns” data science— unifying view is not finding the right
took a single CS course.” it is too big, the challenges are too balance between inclusiveness in ac-
The computer science view es- great, and it requires involvement cepting the contributions of all these
pouses instead the centrality of fields and identifying the core of data
computing. One such view has been c The original article can no longer be found,
science. I believe arguments made
championed by Ullman,42 who also but Vogt’s viewpoint can be found here: http:// earlier in this article have established
uses a Venn diagram to express his www.policyhub.net/node/212. the core, so this danger is averted.

JU LY 2 0 2 3 | VO L. 6 6 | N O. 7 | C OM M U N IC AT ION S OF T H E ACM 115


contributed articles

Conclusion ferent aspects of the issues; they will Intellectual Property 9, 1 (2021), Article 2. https://
bit.ly/3nMPYax.
Despite its recent popularity, the field see their fingerprints in the text of 19. Hey, T., Tansley, S., and Tolle, K. The fourth paradigm:
of data science is still in its infancy this article. I would especially like to Data-intensive scientific discovery. Microsoft Research
(October 2009); https://bit.ly/3nnpxYE.
and much work needs to be done to acknowledge the many early discus- 20. IFIP Guide to Concepts and Terms in Data
scope and position it properly. The sions on framing the field and the Processing Volume 1. Intern. Federation for
Information Processing, North-Holland Publishing
success of early data-science applica- relevant joint work with Raymond Company (1971); https://bit.ly/3LObrZS.
tions is evident—from health scienc- Ng and Nancy Reid. I thank Samer 21. Ilyas, I.F. and Chu, X. Data Cleaning. Association for
Computing Machinery, New York, NY, USA (2019);
es, where social-network analytics Al-Kiswani, Angela Bonifati, Khuzai- https://bit.ly/3LMnaYN.
have enabled the tracking of epidem- ma Daudjee, Maura Grossman, John 22. Jagadish, H.V. Big data and science: Myths and reality.
Big Data Research 2, 2 (2015), 49–52; https://bit.
ics; to financial systems, where guid- Hirdes, Florian Kerschbaum, Jatin ly/44lxEFL.
23. Johnson, G. The world: In silica fertilization; All
ance of investment decisions is based Matwani, Renée Miller, and Patrick science is computer science. The New York Times
on the analysis of large volumes of Valduriez for feedback on all or part (March 25, 2001).
24. Kelleher, J.D. and Tierney, B. Data Science. MIT Press,
data; to the customer care industry, of this article. I very much appreciate Cambridge, Mass. (2018).
where advances in speech recogni- the feedback from the anonymous re- 25. Machine Learning and Data Science in the Power
Generation Industry. P. Bangert (ed.), Elsevier (2021);
tion have led to the development of viewers, who pointed to weaknesses http://bit.ly/4174sAg.
chatbots for customer service. How- in some of my arguments and chal- 26. MacKay, R.J. and Oldford, R.W. 2000. Scientific
method, statistical method and the speed of light.
ever, these advances only hint at what lenged me to clarify others. These Statistical Science 15, 3 (2000), 254–278; https://bit.
is possible; the full impact of data helped improve the article. ly/41W5m3e.
27. Maydon, T. The 4 Types of Data Analytics (2017);
science has yet to be realized. Signifi- https://bit.ly/42jIeLA.
cant improvements are required in References
28. Milligan, I. History in the Age of Abundance? How
the Web is Transforming Historical Research. McGill
fundamental aspects of data science 1. ACM Data Science Task Force. Computing University Press, Montreal (2019).
competencies for undergraduate data science 29. Moura, J. and Serrão, C. Security and privacy issues of
and in the development of integrated curricula. Association for Computing Machinery big data. In Cloud Security: Concepts, Methodologies,
processes that turn data into insight. (January 2021); https://bit.ly/3Vs7z3M. Tools, and Applications, IGI Global (April 2019),
2. Agrawal, D. et al. Challenges and opportunities with 1598–1630; https://bit.ly/3LvfLM6.
Current developments tend to be iso- big data. The Computing Community Consortium of 30. Naur, P. Concise Survey of Computer Methods.
lated to subfields of data science and the CRA (2012); http://bit.ly/3UG0EDN Petrocelli Books (1974); https://bit.ly/420GdEs.
3. Ahalt, S.C. Why data science? The National Consortium 31. Özsu, M.T. and Valduriez, P. Principles of Distributed
do not consider the entire scoping as for Data Science (October 2013); http://bit.ly/3KwC0AV. Database Systems (4th Edition), Springer (2020).
discussed in this article. This siloing 4. Alvargonzález, D. Multidisciplinarity, interdisciplinarity, 32. Richardson, R., Schultz, J.M., and Crawford, K. Dirty
transdisciplinarity, and the sciences. Intern. Studies data, bad predictions: How civil rights violations
is significantly impeding large-scale in the Philosophy of Science 25, 4 (2011), 387–403; impact police data, predictive policing systems, and
http://bit.ly/3zRMBBD.
advances. As a result, the capacity for 5. A new paradigm for business of data. World Economic
justice. New York University Law Review Online 95, 15
(2019), 15–55; https://bit.ly/3NAdeTx.
data-science applications to incorpo- Forum; https://bit.ly/3VupVBf. 33. Sarker, I.H. Smart city data science: Towards data-
6. Berman, F. et al. Realizing the potential of data
rate new foundational technologies science. Communications of the ACM 61, 4 (2018),
driven smart cities with open research issues. Internet
of Things 19 (2022), 100528; https://bit.ly/40S0I4Q.
is lagging. 67–72; http://bit.ly/3Uvu8nS. 34. Shearer, C. The CRSIP-DM model: The new blueprint
7. Bertino, E. and Ferrari, E. Big data security and privacy.
The objective of this article is to In A Comprehensive Guide Through the Italian
for data mining. J. Data Warehousing 5 (2000), 13–22.
35. Solove, D.J. A taxonomy of privacy. University of
lay out a systematic view of the data- Database Research Over the Last 25 Years, S. Flesca, Pennsylvania Law Review 154, 3 (2006), 477–560.
S. Greco, E. Masciari, and D. Saccà (eds.). Springer
science field and to reinforce the key International Publishing (2018), 425–439; https://bit.
36. Steif, K. Public Policy Analytics: Code and Context
for Data Science in Government. CRC Press (2021);
takeaways: It is important to clearly ly/3obRJhh. https://bit.ly/413JgKB.
8. Chakravaram, V. et al. The role of big data, data 37. Stodden, V. The data science life cycle: A disciplined
establish a consistent and inclusive science and data analytics in financial engineering. approach to advancing data science as a science.
view the entire field; it is essential to In Proceedings of the 2019 Intern. Conf. on Big Data Communications of the ACM 63, 7 (2020), 58—66;
Engineering, Association for Computing Machinery, https://bit.ly/42j3DEY.
define the core of the field while being 44–50; http://bit.ly/3oawNas. 38. Supriya, P. et al. Trends and application of data science
inclusive to avoid becoming a catch- 9. Choi, B.C.K and Pak, A.W.P. Multidisciplinarity, in bioinformatics. In Trends of Data Science and
interdisciplinarity and transdisciplinarity in health Applications: Theory and Practices, S.S. Rautaray, P.
all or whatever the particular circum- research, services, education and policy: 1. Definitions, Pemmaraju, and H. Mohanty (eds.), Springer Singapore
stances allow; it is critical to take a ho- objectives, and evidence of effectiveness. Clinical and (2021), 227–244; https://bit.ly/42cpiym.
Investigative Medicine 29, 6 (2006), 351–364. 39. Tabladillo, M. et al. The team data science process life
listic view of activities that comprise 10. Conway, D. The data science Venn diagram. cycle. Microsoft Learn (2022); https://bit.ly/4107xBt.
DrewConway.com (2015); http://bit.ly/3mtgtkF.
data science; and a framework needs 11. Data Science Applied to Sustainability Analysis. J.
40. The world’s most valuable resource is no longer oil,
but data. The Economist (May 2017).
to be established to facilitate coop- Dunn and P. Balaprakash (eds.), Elsevier (2021); 41. Tukey, J.W. The future of data analytics. The Annals of
http://bit.ly/414PQSd.
eration and collaboration among a 12. Data Science for Healthcare, S. Consoli, D.R. Recupero,
Mathematical Statistics 33 (1962), 1–67.
42. Ullman, J.D. The battle for data science. IEEE Data
number of disciplines. and M. Petković (eds.), Springer (2019); http://bit. Engineering Bulletin 43, 2 (2020), 8–14; https://bit.
ly/3MY7Old. ly/3p3RzJr.
13. Davidian, M. Aren’t we data science? Magazine of
Acknowledgments American Statistical Association (2013); http://bit.
ly/413AwVH. M. Tamer Özsu (tamer.ozsu@uwaterloo.ca) is a professor
A preliminary version of the ideas in 14. Donoho, D. 50 years of data science. J. Computational at the Cheriton School of Computer Science, University of
this article appeared in an opinion and Graphical Statistics 26, 4 (2017), 745–766; http:// Waterloo, Ontario, Canada.
bit.ly/3zU2qrh.
piece in a 2020 bulletin of the IEEE 15. Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P. From
Copyright is held by the owner/author(s).
Technical Committee on Data Engi- data mining to knowledge discovery in databases.
Publication rights licensed to ACM.
AI Magazine 17, 3 (March 1996), 37–54; https://bit.
neering 43, (3) 3–11. My views on data ly/4233xSa.
science were sharpened in discus- 16. Floridi, L. and Taddeo, M. What is data ethics?
Philosophical Trans. of the Royal Society A:
sions with many colleagues as we Mathematical, Physical and Engineering Sciences 374,
worked on several data-science pro- 2083 (2016); https://bit.ly/3LxA5MS.
17. Getoor, L. et al. Computing research and the emerging Watch the author discuss
posals. I thank colleagues (too many field of data science. Computing Research Association this work in the exclusive
(2016); https://bit.ly/3Vwz7oH. Communications video.
to list individually) who participated 18. Grimm, P.W., Grossman, M.R., and Cormack, G.V. https://cacm.acm.org/videos/data-
in these initiatives and taught me dif- Artificial intelligence as evidence. J. Technology and science-systematic-treatment

116 COMM UNICATIO NS O F T H E AC M | J U LY 2023 | VO L . 66 | NO. 7

You might also like