Getting Your Data Ready For Ai Oreilly Ebook 87023487USEN

Co
m
pl
im
en
ts
Getting Your
of
Data Ready for AI
Governing Principles for Fast
Self-Service Data Preparation
Kate Shoup
Getting Your Data
Ready for AI
Kate Shoup
Beijing Boston Farnham Sebastopol Tokyo

Getting Your Data Ready for AI
by Kate Shoup
Copyright © 2019 O’Reilly Media. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA
95472.
O’Reilly books may be purchased for educational, business, or sales promotional use.
Online editions are also available for most titles (http://oreilly.com/safari). For more
information, contact our corporate/institutional sales department: 800-998-9938 or
corporate@oreilly.com.
Editor: Nicole Tache Proofreader: Rachel Head

Production Editor: Melanie Yarbrough Interior Designer: David Futato
Copyeditor: Octal Publishing Services, Cover Designer: Karen Montgomery
Inc. Illustrator: Rebecca Demarest
November 2018: First Edition
Revision History for the First Edition

2018-11-28: First Release
This work is part of a collaboration between O’Reilly and IBM. See our statement of
editorial independence.
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Getting Your Data
Ready for AI, the cover image, and related trade dress are trademarks of O’Reilly
Media, Inc.
The views expressed in this work are those of the author, and do not represent the
publisher’s views. While the publisher and the author have used good faith efforts to
ensure that the information and instructions contained in this work are accurate, the
publisher and the author disclaim all responsibility for errors or omissions, includ‐
ing without limitation responsibility for damages resulting from the use of or reli‐
ance on this work. Use of the information and instructions contained in this work is
at your own risk. If any code samples or other technology this work contains or
describes is subject to open source licenses or the intellectual property rights of oth‐
ers, it is your responsibility to ensure that your use thereof complies with such licen‐
ses and/or rights.
978-1-492-04239-6
[LSI]
Table of Contents
Getting Your Data Ready for AI. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Abstract 1
The State of Artificial Intelligence 1
Impediments to Growth 2
Data Science: An Overview 3
A Typical AI Workflow 4
The Data-Wrangling Bottleneck 6
Solutions 8
Combining Data Catalogs and Data Science Tools: Watson
Studio 9
Conclusion 18
v
Getting Your Data Ready for AI
Abstract
This report briefly discusses an aspect of artificial intelligence (AI)
and data science that is critical but rarely addressed: data prepara‐
tion. The report first provides a brief overview of the disciplines of
AI and data science. It then outlines a typical AI workflow, with a
focus on the phases involved in data wrangling, before defining the
challenges associated with those phases. Finally, it presents various
solutions to these challenges, with special emphasis on IBM Watson
Studio. Along the way, you will gain insights into the various types
of data used for AI, the different flavors of AI (including machine
learning), and possible long-term ramifications of the growing use
of these technologies.
The State of Artificial Intelligence

The development of ever-faster computers, improvements in AI
algorithms, and the exponential increase in available data—most
notably unstructured data like audio, video, and photos, often
referred to as big data—has led to an increased interest in AI, partic‐
ularly in the business sphere. Indeed, according to a 2017 report
issued by The Boston Consulting Group and MIT Sloan Manage‐
ment Review, “three-quarters of executives believe AI will enable
their companies to move into new business” and “almost 85%
believe AI will allow their companies to obtain or sustain a competi‐
tive advantage.”
This same report reveals something else, however: as interesting as
executives find AI—and its various subdisciplines, including
1
machine learning, in which machines become capable of learning
without input from humans—few organizations have implemented
this technology in any significant way. According to the report,
“only about one in five companies has incorporated AI in some
offerings or processes,” and “only one in 20 companies has exten‐
sively incorporated AI in offerings or processes.” Moreover, “less
than 39% of all companies have an AI strategy in place.”
Impediments to Growth
One impediment to the growth of AI is the onerous process associ‐
ated with developing AI models—a job performed by data scientists,
who are precious resources indeed. Particularly burdensome are the
phases of this process that involve accessing, labeling, and trans‐
forming data—also known as data wrangling or ETL, which is short
for extract, transform, and load. Indeed, it is often reported that data
scientists, whose job is to develop and deploy AI models spend as
much as 80% of their time on these phases. Perhaps worse (at least
from the point of view of the data scientist), according to IBM Wat‐
son machine learning product manager Armand Ruiz, 57% of data
scientists complain that data wrangling is the most tedious—and
therefore their least favorite—part of their job.
The resulting bottleneck not only prevents data scientists from
focusing on the parts of their job that they like and that yield real
business value, but also slows the adoption and implementation of
AI and the realization of its many tangible benefits, including accel‐
erated research and discovery, enriched customer interactions,
reduced operational costs, increased efficiency, higher revenue, and
so on.
Data Is the Foundation of AI

“The foundation for AI is data,” writes IT expert William McKnight
for TDWI. “Your data determines the depth of AI you can ach‐
ieve…and its accuracy.” Innovator and entrepreneur Will Murphy
agrees, describing data as the “oil that fuels AI.” Simply put, without
good data—and lots of it—an AI system cannot succeed. This
explains why the bottleneck around data wrangling is so problem‐
atic.
2 | Getting Your Data Ready for AI

Data Science: An Overview
Data science expert Alex Castrounis defines data science as “an
umbrella term that encompasses all the techniques and tools used
during the life cycle of useful data to leverage existing data sources
and create new ones as needed to extract meaningful information
and actionable insights.” Using “business domain expertise, effective
communication, computer science, and utilization of any and all rel‐
evant analytical and statistical techniques, data visualization, pro‐
gramming languages and packages, data infrastructure, and so on,”
says Castrounis, data scientists discover, extract, condition, analyze,
interpret, model, visualize, report on, and present data. Businesses
employ data science to generate predictions, recommendations, and
actionable insights; classify, rank, and score; detect patterns, groups,
and anomalies; automate processes and decision making; optimize
business practices; identify market segments; and provide voice and
image recognition—all examples of AI in action.
Not surprisingly, the importance of data science, and of data scien‐
tists, has grown exponentially in recent years. Indeed, demand for
the data scientist—which Thomas H. Davenport and D. J. Patil
called “the sexiest job of the 21st century”1 in a 2012 Harvard Busi‐
ness Review article—is through the roof. Supply, however, is short.
As distinguished engineer and director of product offering manage‐
ment at IBM Jay Limburn observed during a discussion with the
author, “There’s frankly not enough data scientists in the world.”
This shortage has become a critical constraint in some sectors—
made worse by the labor-intensive nature of most data science
workflows.
1 Davenport, Thomas H. and D. J. Patil. “Data Scientist: The Sexiest Job of the 21st Cen‐
tury.” Harvard Business Review 90 (2012): 70–76.
Data Science: An Overview | 3

Comparing Data Scientists and Data Analysts
Many confuse data scientists with data analysts—but they shouldn’t.
“Data analysts,” Castrounis notes, “are often given questions and
goals from the top down, perform the analysis, and then report
their findings.” In contrast, data scientists seek to discover “which
business goals are most important and how the data can be used to
achieve certain goals,” and “tend to generate the questions them‐
selves.” Furthermore, “data scientists typically leverage program‐
ming with specialized software packages and employ much more
advanced statistics, analytics, and modeling techniques.”
Katharine Jarmul, coauthor of Data Wrangling with Python
(O’Reilly), agrees. She contends that data analysis typically involves
“trying to show overall trends or patterns in data or finding out
how a particular group or subsample behaves.” In contrast, data sci‐
ence generally means using this data to “make predictions, build a
language model, or build something a person might interact with.”
A Typical AI Workflow
When it comes to the workflow used by data scientists for AI
projects, “there is some difficulty in describing ‘typical,’” says Jarmul.
“The data that you’re working with and the algorithms or models
that you are using can substantially affect what you need to do to get
your data prepared.” Complicating matters, there are various flavors
of AI systems, including supervised learning systems, unsupervised
learning systems, and reinforcement learning systems. Regardless,
most AI projects typically involve the following steps:
1. Connecting to data sources to access data

2. Labeling data
3. Transforming data
4. Building, training, and testing the model
5. Deploying the model
6. Monitoring, analyzing, and managing the model
Data scientists generally perform the first four steps of this process,
whereas DevOps professionals typically handle steps 5 and 6.

As mentioned, this report focuses on the first three steps of the pro‐
cess: connecting to data sources to access data, labeling data, and
transforming data (in other words, everything the data scientist has
to do before they can do the fun stuff).
Connecting to Data Sources to Access Data

The first step in any AI workflow is some form of data discovery.
This involves locating and accessing various types of data from both
internal and external sources as well as ascertaining its provenance
to ensure that you are permitted to use it.
Types of data
There are three main types of data:
Structured data
Structured data is data that can be (or already has been) easily
organized into a spreadsheet or relational database. For exam‐
ple, data in a sales record is structured data.
Unstructured data
Unstructured data is data that does not fit easily into a spread‐
sheet or relational database. Audio files, video files, and PDF
files are examples of unstructured data.
Semi-structured data
This type of data is essentially a hybrid of structured and
unstructured data. In other words, it’s unstructured data that
has structured data attached to it in the form of metadata.
Examples of semi-structured data include comma-separated
values (CSV) files and Twitter messages.
These distinctions are important, says Castrounis, “because they’re
directly related to the type of database technologies and storage
required, the software and methods by which the data is queried and
processed, and the complexity of dealing with the data.”
In addition to these various types of data, there are different data
formats—many of which might be incompatible. For example, data
harvested from web pages might not play nicely with data pulled
from mobile devices, which itself might tangle with data culled from
an on-premises database, and so on.
A Typical AI Workflow | 5
Data provenance
Before you can use any data that you deem relevant in your AI
project, you must trace its provenance—where it came from, how it
was collected, who collected it, and under what conditions. Data
provenance, says Katharine Jarmul, “is incredibly important,
because it’s going to inform your data science team how they should
treat the data and what they can use from it.” It might be the case
that due to privacy concerns some data is off limits. “You have to
figure out from a legal standpoint and a utilitarian standpoint what
data you can use for the problem you want to solve,” says Jarmul.
Labeling Data
“Any attempt to manage and organize information,” observe IBM’s
Jay Limburn and Paul Taylor in a 2017 blog post, “depends on two
things: data and metadata.” They explain: “The data is the informa‐
tion itself, while the metadata describes the information’s attributes,
such as what structure it is stored in, where it is stored, how to find
it, who created it, where it came from, and what it can be used for.”
Applying this metadata—in other words, labeling the data—is a crit‐
ical step in the data-prep workflow. You might also need to “look
over the data and mark a particular thing that you’re trying to
study,” explains Jarmul—often called the target variable.
Transforming Data
After you access and label data, the data usually goes through a ser‐
ies of transformations. These transformations might include remov‐
ing noise, standardizing the data, and so on, depending on what
type of model you want to build. Standardizing data essentially
means fixing any disparities in the data. Often, you’ll be working
with data from different types of sources—meaning it might not
match up. It’s up to you to figure out how to accommodate these
data disparities and to bring the data into some sort of cohesive
model.
The Data-Wrangling Bottleneck

Most people enter the field of data science because “they love the
challenge of developing algorithms and building machine learning
models that turn previously unusable data into valuable insight,”
writes IBM’s Sonali Surange in a 2018 blog post. But these days, Sur‐

ange notes, “most data scientists are spending up to 80 percent of
their time sourcing and preparing data, leaving them very little time
to focus on the more complex, interesting and valuable parts of their
job.” (There’s that 80% figure again!)
This bottleneck in the data-wrangling phase exists for various rea‐
sons. One is the sheer volume of data that companies collect—
complicated by limited means by which to locate that data later. As
organizations “focus on data capture, storage, and processing,” write
Limburn and Taylor, they “have too often overlooked concerns such
as data findability, classification and governance.” In this scenario,
“data goes in, but there’s no safe, reliable or easy way to find out
what you’re looking for and get it out again.” Unfortunately,
observes Jarmul, the burden of sifting through this so-called data
lake often falls on the data science team.
Another reason for the data-wrangling bottleneck is the persistence
of data silos. Data silos, writes AI expert Edd Wilder-James in a 2016
article for Harvard Business Review, are “isolated islands of data” that
make it “prohibitively costly to extract data and put it to other uses.”
Some data silos are the result of software incompatibilities—for
example, when data for one department is stored on one system,
and data for another department is stored on a different and incom‐
patible system. Reconciling and integrating this data can be costly.
Other data silos exist for political reasons. “Knowledge is power,”
Wilder-James explains, “and groups within an organization become
suspicious of others wanting to use their data.” This sense of pro‐
prietorship can undermine the interests of the organization as a
whole. Finally, silos might develop because of concerns about data
governance. For example, suppose that you have a dataset that might
be of value to others in your organization but is sensitive in nature.
Unless you know exactly who will use that data and for what, you’re
more likely to cordon it off than to open it up to potential misuse.
In addition to prolonging the data-wrangling phase, the existence of
data lakes and data silos can severely hamper your ability to locate
the best possible data for an AI project. This will likely affect the
quality of your model and, by extension, the quality of the broader
organizational effort that your project is meant to support. For
example, suppose that your company’s broader organizational effort
is to improve customer engagement, and as part of that effort it has
enlisted you to design a chatbot. “If you’ve built a model to power a
chatbot and it’s working against data that’s not as good as the data
The Data-Wrangling Bottleneck | 7

your competitor is able to use in their chatbot,” says Limburn, “then
their chatbot—and their customer engagement—is going to be
better.”
Solutions
One way to ease the data-wrangling bottleneck is to try to address it
up front. Katharine Jarmul champions this approach. “Suppose you
have an application,” she explains, “and you’ve decided that you
want to use activity on your application to figure out how to build a
useful predictive model later on to predict what the user wants to do
next. If you already know you’re going to collect this data, and you
already know what you might use it for, you could work with your
developers to figure out how you can create transformations as you
ingest the data.” Jarmul calls this prescriptive data science, which
stands in contrast to the much more common approach: reactionary
data science.
Maybe it’s too late in the game for that. In that case, there are any
number of data catalogs to help data scientists access and prepare
data. A data catalog centralizes information about available data in
one location, enabling users to access it in a self-service manner. “A
good data catalog,” writes analytics expert Jen Underwood in a 2017
blog post, “serves as a searchable business glossary of data sources
and common data definitions gathered from automated data discov‐
ery, classification, and cross-data source entity mapping.” According
to a 2017 article by Gartner, “demand for data catalogs is soaring as
organizations struggle to inventory distributed data assets to facili‐
tate data monetization and conform to regulations.” Examples of
data catalogs include the following:
• Microsoft Azure Data Catalog

• Alation Catalog
• Collibra Catalog
• Smart Data Catalog by Waterline
• Watson Knowledge Catalog
In addition to data catalogs to surface data for AI projects, there are

several tools to facilitate other data-science tasks, including connect‐
ing to data sources to access data, labeling data, and transforming
data. These include the following:

Database query tools
Data scientists use tools such as SQL, Apache Hive, Apache Pig,
Apache Drill, and Presto to access and, in some cases, transform
data.
Programming languages and software libraries
To access, label, and transform data, data scientists employ tools
like R, Python, Spark, Scala, and Pandas.
Notebooks
These programming environments, which include Jupyter,
IPython, knitr, RStudio, and R Markdown, also aid data scien‐
tists in accessing, labeling, and transforming data.
Combining Data Catalogs and Data Science

Tools: Watson Studio
As helpful as the aforementioned data science tools can be, they
share one critical limitation: they’re siloed. As IBM’s Armand Ruiz
explains in a 2018 blog post, “Only highly technical professionals in
IT [can] organize and make sense of the vast amounts of data,” and
only domain experts, or subject matter experts, can “successfully
convert data into the rich knowledge needed by AI.” The result, he
says, is that “domain experts and IT professionals [work] in silos,
with different tools and no visibility to each other’s work.” To solve
this problem IBM developed Watson Studio, which combines data
catalogs, including Watson Knowledge Catalog, and support for var‐
ious familiar data science tools, including R, Python, Scala, Jupyter
notebooks, RStudio, and more, into one seamless environment.
To describe Watson Studio, IBM’s Jay Limburn invokes Amazon,
observing that trying to do data science without Watson Studio
would be a little like trying to find products on Amazon using the
company’s backend inventory tools rather than its frontend shop‐
ping portal. “There wouldn’t be very good search,” he observes. “We
wouldn’t have pictures of the products; we wouldn’t see how prod‐
ucts are related, or who liked them, or who bought them…all this
other stuff.” And yet, says Limburn, “that’s what we’ve been asking
data scientists to do with data.” Watson Studio, with Watson Knowl‐
edge Catalog, represents “the storefront end around our data,” mak‐
ing it easier for users to find and “consume” it.
Combining Data Catalogs and Data Science Tools: Watson Studio | 9

Simply put, “Watson Studio closes the gap with a unified experience
to create new insights from knowledge contained in the data,” says
Ruiz. Watson Studio—a cloud-based service—offers tools to aid in
each step of the data science process: collecting, labeling, and trans‐
forming data; building, training, testing, and deploying models; and
monitoring, managing, and analyzing models. With Watson Studio,
data scientists, domain experts, and application developers can col‐
laborate to build, train, and deploy AI models at scale.
To minimize the bottleneck associated with wrangling data, Watson
Studio integrates two key tools:
• The aforementioned Watson Knowledge Catalog

• Data Refinery
Watson Knowledge Catalog

You can use Watson Knowledge Catalog to unearth data, models,
and more, as well as to curate, categorize, and share data—all using
a self-service platform. With Watson Knowledge Catalog, you can
do the following:
Connect to data
Thirty prebuilt data connectors allow users to establish connec‐
tions with commonly used data sources, both on-premises and
in the cloud. This functionality goes a long way toward unclog‐
ging the bottleneck associated with connecting to and accessing
data. (See Figure 1-1 for a list of available data sources.)

Figure 1-1. Accessing data from multiple repositories in Watson
Knowledge Catalog (image provided courtesy of IBM)
Discover data
Watson Knowledge Catalog facilitates the discovery and inges‐
tion of data by enabling users to search for the data that they
need in a single, centralized portal (see Figure 1-2). A recom‐
mendation engine connects users with relevant data the same
way Netflix “unlocks” new TV shows based on other shows
you’ve watched, explains Limburn. (Ruiz calls this “Spotify for
data.”) When you find data that you want to use, you click the
“Add to Catalog” button to add it to your project dataset—like
using the “Add to Cart” button you see on ecommerce sites like
Amazon.

Figure 1-2. Data discovery in Watson Knowledge Catalog (image pro‐
vided courtesy of IBM)
Classify data
When you add a data asset to a project in Watson Knowledge
Catalog it is automatically indexed and classified, essentially
automating the “labeling data” step in the data science workflow
(see Figure 1-3). In addition, say Limburn and Taylor, “Users
can add tags and comments to explain what information each
dataset contains, and why it is useful.” They can also rate data‐
sets using a star system.

Figure 1-3. Automatic classification of data in Watson Knowledge
Catalog (image provided courtesy of IBM)
Govern data
Watson Knowledge Catalog is “underpinned by an intelligent
and robust governance framework that ensures its users comply
with corporate data governance policies,” writes IBM’s Susanna
Tai in a 2017 blog post. This framework allows for the secure
sharing of data assets across the organization through the use of
well-defined access control policies.
Data Governance with Watson Knowledge Catalog

Historically, observes Jay Limburn, “data governance has been
about protecting and locking away data.” For example, consider a
dataset with 20 columns, one of which contains credit card num‐
bers. “In the old world, we’d lock the whole dataset away,” says Lim‐
burn—even if the other 19 columns contained extremely relevant
data.
Watson Knowledge Catalog addresses this issue by enabling users
to mask certain columns in a dataset from users who should not
have access to them, but allow those users to surface other columns
in that same dataset for use in their models. Watson Knowledge
Catalog achieves this functionality by using security roles. Limburn
explains: “You would have your own custom view of data based on
the policies that are defined for that data and on what roles you
have been assigned.”

Data Refinery
Although Watson Knowledge Catalog can assist in assembling your
dataset, another Watson Studio function, Data Refinery (see
Figure 1-4), can be employed to clean it, removing incorrect, incom‐
plete, duplicated, or improperly formatted data from the set. Data
Refinery can also help to shape the data, by filtering and sorting it,
combining or removing columns, and performing various data
operations (see Figure 1-5). Finally, Data Refinery can validate data
and produce interactive visualizations such as charts and graphs to
help users identify hidden patterns, connections, and relationships.
All this occurs within in an intuitive graphical user interface. This
enables even non–data scientists to prepare data, freeing data scien‐
tists to focus on the work they enjoy most and that adds the most
value to the organization.
Figure 1-4. Data Refinery (image provided courtesy of IBM)

Figure 1-5. Interactive data visualizations in Data Refinery (image
provided courtesy of IBM)
“Effectively,” says Sonali Surange of IBM, “Data Refinery acts as a

sandbox where the user can experiment without risk.” She explains:
“Instead of having to specify their requirements up front, the user
can experiment with different data transformations in a free-
flowing, iterative process—adding, removing, and re-ordering steps
until they find the right ‘recipe’ to shape the data for future analysis.”
The tool can also automatically suggest common functions that the
user might want to apply.
Real-World Watson Studio Example

Suppose that you want to prepare a dataset of building violations in
the city of Chicago to trigger an automated system that flags open
cases for review.
Your first step is to assemble the dataset, which should consist of a
series of records that define specific instances of building violations.
These records contain various fields, such as VIOLATION_DATE,
VIOLATION_LOCATION, VIOLATION_CODE, VIOLATION_DESCRIPTION,
VIOLATION_STATUS, INSPECTOR_ID, and so on. To assemble this
dataset, you use Watson Knowledge Catalog to search for appropri‐
ate data. When you find data that you want to include in the set,
you simply click “Add to Project.”
After you assemble your dataset, you click “Refine” to view it in
Data Refinery. Here, you can apply various predefined or manual
operations to clean and shape the data. In this case, your first step

will be to mask the INSPECTOR_ID column for privacy reasons by
replacing the values in this column with random strings. You can
do this in one of two ways: by using Data Refinery’s Operation
menu or by using R to handcode it. Let’s try the first approach:
1. Select the INSPECTOR_ID column.

2. Click the Operation button.
3. In the menu that appears, select the Substitute operation.
Next, you want to filter the dataset to show open records only.
Again, you can do this by way of the Operation menu or by coding
it yourself. In this case, let’s go the second route.
1. Click the “Code an Operation” field next to the Operation but‐

ton.
2. In the menu that opens, select Filter.
3. In the R string that appears, click the Column placeholder, and
then choose VIOLATION_STATUS.
4. Click the logicalOperator placeholder, and then select ==.
5. Select the provide_value text, and then type OPEN.
6. Click the Apply button.
After you enter all the operations you want to apply to your dataset,
click Run to execute the operations in order. You can choose to
write the results of this action back to your database or save it in the
format of your choice. As Data Refinery cleans and shapes your
data, you can continue to monitor and analyze its progress in the
Control Panel. (You can also schedule new or existing runs here.)
Alternatively, you can continue working in Watson Studio.
Innate Characteristics of Watson Studio

Specific tools and features accessible from within Watson Studio,
such as Watson Knowledge Catalog and Data Refinery, assist data
scientists in the laborious process of data preparation. But that’s not
all. Innate characteristics of the software also facilitate this work. For
example, Watson Studio is:
Self-service
It used to be, says Ruiz, that “advanced data preparation was
only available through the use of very skilled data scientists.”

That’s changed. Tools like Watson Studio allow people in vari‐
ous roles—such as domain experts, developers, or business ana‐
lysts—to do data science. This in turn frees up data scientists to
focus on the more engaging and valuable parts of their job. It
also, says Sonali Surange, “brings the right users closer to the
data.” As she explains, together with data scientists, “business
analysts and other line-of-business users are often the people
who have the best operational understanding of the data, so
they are in the best position to prepare and shape it for produc‐
tive analysis.” People in these roles “are also much more likely to
be able to identify, diagnose and remediate data quality issues
early in the process, potentially saving hours of wasted effort
further down the line.” As an added benefit, “by helping these
users work with data sources more independently, self-service
data preparation also avoids bottlenecks between teams, reduc‐
ing the impact of competing priorities on business goals and
deadlines.”
Collaborative
“We are convinced,” says Ruiz, “after working with clients
around the world, that rich collaboration is key [to] unlocking
the full potential of AI.” Watson Studio can help facilitate this
collaboration. Indeed, different teams can work together “even
if they use different technologies,” observes Paul Taylor. Mem‐
bers of these disparate communities, says Taylor, “can use the
special technologies they prefer, but not have to then copy the
data, to be able to see each other’s work, and see the comments,
and be able to annotate it and collaborate on it, and maybe
invite other people into their projects.” If you need to work
across organizations, Taylor adds, having Watson Studio on the
cloud “makes it easier to do that type of collaboration.”
Agile
Watson Studio being a cloud-based service, says Taylor, “means
we’re delivering changes—pretty much every day in some cases
—into that system.” And this means users “keep getting more
capabilities.”
Secure
IT professionals commonly assert that cloud services are more
secure than on-premises applications and data, and Watson Stu‐
dio is no exception. Data in Watson Studio is encrypted at rest
and in motion, disaster resilient, and GDPR-compliant. (For

more on IBM cloud security, see IBM’s cloud security web
page.)
Use Cases for Watson Studio

There are countless use cases for AI, including in the automotive,
manufacturing, retail, finance, agriculture, energy, healthcare, phar‐
maceuticals, media, telecom, transport, and other industries. In
contrast, the use case for Watson Studio “is not really tied to any
specific industry,” says Jay Limburn. Rather, the use case is, “I need
to build a model. How do I get started?” Simply put, “It’s really all
about finding information quickly, being able to wrangle it, shape
it, get it in the format I need it, and then do something productive
with it through data science.”
Conclusion
If what the experts say is true—that AI represents the next wave of
digital disruption; that its impact will rival that of earlier general-
purpose technologies like the steam engine, electricity, and the inter‐
nal combustion engine; that, in the words of Google CEO Sundar
Pichai, it will be “more important than humanity’s mastery of fire or
electricity”—it follows that organizations that effectively employ AI
will enjoy a critical advantage over organizations that don’t. And yet,
at present, relatively few organizations do this—in part because of
problems posed by wrangling data.
That’s where self-service data science tools like Watson Studio come
in. These tools help to eliminate the bottleneck associated with data
wrangling. This not only frees data scientists—who are in limited
supply—to focus on the parts of their jobs that bring more value,
but might also hasten the widespread adoption of AI. When this
happens, expect early adopters to enjoy a significant advantage over
firms that lagged behind.

About the Author
Kate Shoup is a freelance writer and editor who has written,
coauthored, or ghost-authored more than 50 books on a variety of
topics. Titles include iPod and iTunes Visual Quick Tips, iPhone Vis‐
ual Quick Tips, Windows 7 Digital Classroom, Teach Yourself Visually
Microsoft Office 2010, and Laptops Simplified (all from John Wiley &
Sons). She also handles various corporate writing tasks, including
developing website copy and other marketing materials, and com‐
posing a corporate history for a multinational corporation.

Getting Your Data Ready For Ai Oreilly Ebook 87023487USEN

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Getting Your Data Ready For Ai Oreilly Ebook 87023487USEN

Uploaded by

Copyright:

Available Formats

Co

Beijing Boston Farnham Sebastopol Tokyo

Editor: Nicole Tache Proofreader: Rachel Head

November 2018: First Edition

Revision History for the First Edition

Getting Your Data Ready for AI. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

The State of Artificial Intelligence

Data Is the Foundation of AI

2 | Getting Your Data Ready for AI

Data Science: An Overview | 3

1. Connecting to data sources to access data

4 | Getting Your Data Ready for AI

Connecting to Data Sources to Access Data

The Data-Wrangling Bottleneck

6 | Getting Your Data Ready for AI

The Data-Wrangling Bottleneck | 7

• Microsoft Azure Data Catalog

In addition to data catalogs to surface data for AI projects, there are

8 | Getting Your Data Ready for AI

Combining Data Catalogs and Data Science

Combining Data Catalogs and Data Science Tools: Watson Studio | 9

• The aforementioned Watson Knowledge Catalog

Watson Knowledge Catalog

10 | Getting Your Data Ready for AI

Combining Data Catalogs and Data Science Tools: Watson Studio | 11

12 | Getting Your Data Ready for AI

Data Governance with Watson Knowledge Catalog

Combining Data Catalogs and Data Science Tools: Watson Studio | 13

Figure 1-4. Data Refinery (image provided courtesy of IBM)

14 | Getting Your Data Ready for AI

“Effectively,” says Sonali Surange of IBM, “Data Refinery acts as a

Real-World Watson Studio Example

Combining Data Catalogs and Data Science Tools: Watson Studio | 15

1. Select the INSPECTOR_ID column.

1. Click the “Code an Operation” field next to the Operation but‐

Innate Characteristics of Watson Studio

16 | Getting Your Data Ready for AI

Combining Data Catalogs and Data Science Tools: Watson Studio | 17

Use Cases for Watson Studio

18 | Getting Your Data Ready for AI

You might also like