Professional Documents
Culture Documents
DataCataloging Ebook Final 3 PDF
DataCataloging Ebook Final 3 PDF
2
Businesses need to maximize the value of their As a solution, many organizations have begun
data to drive monetization and increase what to implement DataOps (data operations)
McKinsey & Company refers to as “the insights practices to deliver continuous enterprise data
to value chain.”1 In many cases this includes the that is high-quality and trustworthy. DataOps
leveraging of artificial intelligence (AI) that can orchestrates people, process, and technology
fuel predictive insights and proactive outcomes. to solve the challenges associated with
However, growing volume of data spread across inefficiencies in accessing, preparing, and
multiple deployments as well as internal integrating data. This enables collaboration
obstacles of traditional manual processes and across an organization to drive agility, speed
data stewardship roles remains a challenge. and new initiatives at scale.
Leaders are discovering their current data
processes don’t efficiently scale to tackle today’s At the heart of an effective DataOps practice is a
needs, nor ones they will face in the future, and data catalog, a metadata management tool
yet the importance of being able to find a solution designed to help organizations find and manage
is absolutely imperative. large amounts of data. It puts trusted data in
the hands of a business by automating the
Gartner estimates that by 2021, AI augmentation– organization of a common and known business
a human-centered partnership model between vocabulary, self-service management of data and
people and AI technologies working together– on-boarding of data content. This ebook focuses
will create a business value of $2.9 trillion and on the importance of a modern data catalog
6.2 billion hours of heightened worker productivity and the benefits a business can reap from
worldwide.2 its use when it’s implemented correctly. From
supporting multicloud adoption and integration,
to accelerating an organization’s journey to AI,
the data catalog is at the foundation.
3
Introduction
4
Gartner originally defined a data catalog as a A modern data catalog allows data analysts to
tool that “creates and maintains an inventory find all the data available in each database or
of data assets through the discovery, description application maintained by their organization.
and organization of distributed data sets.” As This can include both relational data and
the quantity of data available to organizations unstructured data which can be found in word
has grown exponentially over the last several documents or spreadsheets, whereas analytic
years, data catalogs have grown in importance assets will include Jupyter Notebooks, trained
and their definition and scope have grown as models and dashboards. Because data catalogs
well. Delivering business-ready data to feed make data sources more discoverable and
analytics and AI projects begins with a data manageable, they help organizations make more
catalog that can automate organization, provide informed decisions about how to use their data.
consistent definitions and enable self-service How to access the data, the data format, the
management of enterprise data. classification of the asset, the asset lineage
and the list of collaborators that have access to
certain kinds of data is the kind of information
that should be embedded inside data assets.
5
The importance of a modern data catalog
6
The importance of a modern data catalog
Benefit from data discovery Expedite data preparation Collaborate across governed assets
Data catalogs must have a record of collaborators In order to help transform large amounts of A catalog helps alleviate manual processes
who need access to certain assets and raw data into consumable, quality information and dependencies with advanced discovery
corresponding information in data sets from that’s ready for analysis, a data catalog should capabilities typically driven by machine learning
across an entire organization, without needing have self-service preparation features to support and semantic context. This makes it easier to
separate credentials for every source. This any data preparation solution your company find relevant assets quickly and at scale.
creates a single platform where any member in already has in place.
an enterprise can locate their data. To ensure Ways in which a catalog enables data
security, the data catalog assigns the correct Make sure the following features are included discovery include:
roles to its users based on their needs and will in your catalog to make it easy to explore, prepare
place the necessary restrictions on what the and deliver data that can be trusted and used – Search keywords and filters based on
user can and can’t do inside the catalog. across your business. subject tags and other asset properties
– Preview capabilities to ensure that you
Types of collaborators and their functions: – Powerful operations that clean, are selecting the correct data asset
organize, fix and validate your data – Reviews about assets created by
– Authors: Subject-matter experts who will – Scripting support for the efficient collaborators within the catalog to help
pull and draft the appropriate information and flexible manipulation of data identify the best assets to pull from
into the catalog – Scheduling and monitoring of data – Asset recommendations that are
– Approvers: Once authors have completed preparation flows automatically compiled based on your
their draft, approvers can review, comment, – Profiles for validating your data usage history, similar assets and other
approve, or deny the delivered information – Visualizations for gaining insight factors
– Publishers: Authorized to publish the into your data
approved information and make the new – Policies that mask data are enforced Dive deeper into the benefits of cataloging
business terms and data assets available – Support for unstructured data
to anyone with access to the business
glossary
7
The importance of a modern data catalog
What it looks like when your business has a What it looks like when your business doesn’t
data catalog and it is implemented correctly: have a data catalog or is implemented incorrectly:
8
Understanding the benefits of a modern data
catalog is just beginning. It’s equally important
to understand how to start integrating it into your
business to realize value faster. When the goal
of your organization is to increase efficiency and
collaboration across stakeholders, the first place
to focus your improvements on should be the
company’s taxonomy. This will become the
foundation for content categorization, data
relationships, and provide a guideline that
improves that speed at which data can be
found, accessed or reused.
9
Setting up a business taxonomy
Step one: Focus on a single high-value Step two: Concentrate on the meaning Step three: Establish benefit Step four: Develop and commit
information area of business definitions and gain interest to milestones
As opposed to trying to organize all of your Use the language of your industry in the form Though adoption of a business taxonomy might The final step is to establish official milestones
assets at once, it is far more efficient to focus of logical or business intelligence models to not happen overnight, it is critical for your that your organization will commit to for
on a particular segment of the business that power existing terms and standards already organization to understand the advantage implementation of the business categories,
will drive the greatest impact. For instance, if set in place. Take time to understand how certain of having a single place where all information business terms, and correct assignment of
compliance and regulatory processes, such as concepts and definitions are currently being is stored. Within a specific sector of your user roles—and moreover the data catalog
for GDPR and CCPA, are high priority for your applied throughout your organization, then build business, champion the idea of selecting a process. Whether you have a mature DataOps
organization, begin with establishing terms your catalog specific to these key components, focused area to start integrating a data catalog culture in place or this is your first step, it is
and classifying assets related to personally data types and common uses of data. with an established business taxonomy, so important to remember that each organization
identifiable information. the organization’s data can be consolidated has unique needs where stakeholders in and
in one place. out of IT need to add value to drive success of
data projects.
10
Setting up a business taxonomy
Data
Identify focus areas Identify sponsors and key Identify data stewardship
governance
stakeholders team
officer
Governance Gather potential terms to Identify the category Select the first set Agree to their definitions
council bootstrap new taxonomy hierarchy of terms to populate
11
An organization can leverage a data catalog to
accomplish the levels of success that enterprise
data leaders are experiencing today. From
ensuring that your enterprise can meet
compliance regulations, facilitating data lake
governance, or cutting down on the time
consuming labor that it takes to govern your
data, the following stories share the data
struggles five different companies were able
to overcome by implementing their own data
catalog.
12
Data catalog use cases
13
Data catalog use cases
Improve regulatory
compliance
Ungoverned sensitive data may lead to The IBM Global Chief Data Office helps The results of this effort were collected in a
regulatory penalties. For instance, if a business analyze and visualize business risks round central data privacy catalog as a key first step
does not rectify any of their violations against sensitive data in the journey to readiness, but it was still
the California Consumer Privacy Act, an attorney Due to GDPR readiness, companies in uncertain how to identify, evaluate and share
general could impose a civil penalty of anywhere possession of personal data from European the discovered information of data that needed
from $2,500 to $7,500 per violation,3 and when it Union data subjects are legally obligated to to be in compliance with the GDPR. As a result,
comes to the GPDR, financial penalties could go understand the types of data they store, where IBM used their own cataloging technologies and
as high as 20 million euros or 4% of worldwide the data lives and its associated levels of risk. created a central store for their privacy data. To
annual turnover.4 Therefore, as organizations face compliment the catalog, IBM Data Risk Manager
growing data privacy regulations, they must look For a company as large as IBM, which operates was also implemented to provide a data risk
more holistically at how they store and use data. in more than 170 countries, it can be a daunting control center for executives and their teams
task to refresh an organization’s privacy to easily view the updated information from
A data catalog can automate the classification practices and ensure that the GDPR guidelines the privacy catalog in a central dashboard and
and profiling of data assets and automatically are met—all while enhancing products and ensure that ongoing requirements to meet data
enforce data protection rules established to services that will ultimately benefit all of its privacy regulations are met.
anonymize and restrict access to sensitive clients. To undergo this task, the Global Chief
information. More importantly, if something goes Data Office (GCDO) created a global program, Learn More: Forrester names IBM a leader in
wrong, controls allow the organization to rapidly among numerous work streams, to address the Machine Learning Catalogs
respond to an issue, whether that means flagging GDPR requirement and more comprehensively
sensitive data, identifying and remediating issues, understand the type of personal data IBM
or collecting information in response to an audit. controls.
14
Data catalog use cases
15
Data catalog use cases
Automate data
governance for
DataOps
An integrated quality and governance platform How Integra LifeSciences adopted an Integra LifeSciences worked with IBM to
helps manage data and protects it from misuse. integrated approach to manage all implement IBM data cataloging technology
For effective governance, an enterprise data parts of their business that creates consistent definitions of its
catalog must be in place. You can’t effectively When implementing various new systems business data and helps them better
apply governance if you don’t have organized and processes into their organization, Integra understand what their data could do for them.
data with proper metadata tags and lineage. LifeSciences, a surgical and medical instrument
Data organization includes detailing each data manufacturing company, found that governance To learn more about what IBM Watson
object: documenting data properties, ownership, in their organization was not a simple feat. The Knowledge Catalog can do for your business
business context, origin, and structure; quantity of data they needed to keep track of take a guided tour to see how business users
evaluating data quality; and properly classifying was quickly multiplying, and they were losing can quickly discover, curate, categorize and
data so it can automatically be used to define track of where the data was located and how share data assets across a whole organization.
and refine an organization’s DataOps practice. they could effectively use it to benefit their
business. By turning to an integrated approach
that collected, defined and managed their data
all in one platform, Integra was able to cut 50%
of business systems, reduce their complex
management of systems and data, and cut
operational costs in order to maximize the
organization’s growth benefits.
16
Data catalog use cases
17
Data catalog use cases
Support a Enable AI
governed data lake governance
Data lake governance takes discipline, good A data catalog can help the enterprise
policy and collaboration between the people governance program grow to support the
who manage data access and the people who maturing demands of AI governance. As AI
access the data. Cataloging helps to tag the takes root, you’ll need an organizational
data in the data lake and create an inventory approach toward developing policies which
of information assets. The catalog interface lets you create a framework to effectively
provides data lake users with information design, deploy, and monitor AI-powered
about the data within its classification, lineage models and algorithms with a focus on
and how it’s governed. The catalog can serve fairness, accountability, transparency, safety,
multiple stakeholders in the organization, and privacy, ensuring fair outcomes.
eliminating inefficiencies associated with
“lost in translation” issues.
18
The modern data catalog goes way beyond that Therefore, as businesses continue to digitally
of the legacy metadata repository businesses transform themselves to build and incorporate
have been using for decades. They surpass the AI into their overall business strategies, the
concept of metadata capture and management value of data catalogs integrated with a data
by including automation and discovery quality and governance platform becomes
techniques such as visual recognition, natural more essential.
language classification and machine learning.
With these capabilities, a data catalog can IBM Watson Knowledge Catalog is an open
organize data in near real-time with the added and intelligent data catalog for managing
benefit of eliminating the inefficient manual enterprise data and AI model governance, There’s a reason our Talk to an expert to learn more
processes required by older repositories. quality and collaboration. By providing an
customers named Watson about Watson Knowledge
end-to-end experience rooted in metadata
The new wave of intelligent data catalogs is and active policy management, it helps data Knowledge Catalog a 2020 Catalog and explore its
not only changing the way business is run via citizens quickly discover, curate, categorize, Gartner Customer Choice seamless integration with
virtualization and multicloud deployment, but and share data assets, data sets, analytical
how organizations are carving new business models, and their relationships with other Award Winner. Test drive IBM DataOps services for
models and preparing for the future of AI. members of your organization. the product to see why. IBM Cloud Pak® for Data.
19
© Copyright IBM Corporation 2020
01 Holger Hürtgen and Niko Mohr. “Achieving business impact
IBM Corporation with data”, Microsoft Report, April 2018.
Route 100
Somers, NY 10589 02 “AI Augmentation Will Create $2.9 Trillion of Business Value
in 2021”, Gartner, August 2019.
Produced in the United States of America
July 2020 03 Nicholas Schmidt. “Top 5 Operational Impacts of CCPA: Part
5 - Penalties and enforcement mechanisms”, International
IBM, the IBM logo, ibm.com, IBM Cloud Pak and Watson are Association of Privacy Professionals (IAPP), August 2018.
trademarks of International Business Machines Corp., registered
in many jurisdictions worldwide. Other product and service names 04 “IBM Pathways for GDPR readiness”, IBM White Paper,
might be trademarks of IBM or other companies. A current list September 2017.
of IBM trademarks is available on the web at “Copyright and
trademark information” at www.ibm.com/legal/copytrade.shtml. EWDPJZDQ
The performance data and client examples cited are presented for
illustrative purposes only. Actual performance results may vary
depending on specific configurations and operating conditions.