You are on page 1of 16

How do Data Catalogs

Built as Knowledge Graphs


Enable an Enterprise Data Fabric
What is Captured in a Data Catalog? Why Do We Need Data Catalogs?

• Data Dictionary • “Identifiers” To find relevant data and to interpret,


Data elements, datatypes, IDs, acronyms, alternative terms aggregate, integrate and translate data
min/max length, primary
and foreign keys, • Documentation A one-stop-shop for all data stakeholders
Description, see-also, related
generated/nullable • T o help them find, understand and use available data
• Data Samples • Stewardship • S upporting movement towards data democratization
Who is responsible, who Answering questions like this is key to creating an
• T raceability publishes,status, when updated enterprise data fabric
Mapping to business
• Categories/Aspects/Topics •W hat data do we have? •W  here is it?
terms, requirements,
organizations, processes • Temporal — what dates • 
H ow can it be accessed? • ho provides it?

W
are in scope; •W hat is its quality and reliability?
Data Catalog
• Data Quality  eographical — what
•G
Quality level, number of locations are in scope; Data Catalog
issues, quality over time • Social — who uses,
endorsements; Any other
• Data Lineage categories
How data was derived
• Data Access
• Compliance Access point, access method,
Relevant regulations access control, format, license

2
What is Data Fabric?

•D
 ata fabric is an architectural approach
designed to help organizations better
deal with fast growing volumes of data,
ever-changing application requirements
and distributed processing scenarios.
 ata fabric is not a single technology or even
•D
a simple collection of tools; • Y ou can think of data fabric as a web connecting multiple
locations, types, and sources of data — both on-premises
• It is a design concept that requires multiple existing
and emerging data management technologies and in the public cloud. This network of connections includes
information on a variety of methods for accessing data in
order to process, move and manage it.
• I f you have not heard the term “data fabric” yet, you will. •A
 data catalog implemented as a Knowledge Graph is a key
It is a rapidly growing in popularity. Some analysts estimate component of the enterprise Data Fabric, containing and
the data fabric market will reach $4.5B by 2026. connecting rich metadata about data sources available
within the fabric.

3
What are Knowledge Graphs?

•A
 Knowledge Graph represents a knowledge domain

• I t represents knowledge as a graph RULES: If both of a person’s parents have blue eyes, they will also have blue eyes
• A network of nodes and links
• Not tables of rows and columns

• It represents facts (data) and models (metadata)


in the same way
• Flexible and extensible
• Rich rules and inferencing

• I t is based on open standards, from top to bottom MODELS : A person has eye color. A person has two parents.
• Readily connects to knowledge in private and public clouds A person’s father is also a person and he is male.

There can be different types and


instances of Knowledge Graphs …

FACTS: James has blue eyes. James’ father is Andrew. James is a person.
4
Data Fabric Architecture
N CE A N D S TA N D
GOVERNA A
D ATA RDS

S DTaAtaDDEeLliIv
DA VeErR
yY
M ER DA
TA
N SU CO
CO NS
A ENRICHMENT AND CONN
E FOR E CTI
GIN
T

U
ON
DA

M
N
EE OF

ER
G O N N E C T E D ME
ED C KNO

S
A S A WL TA
L
W L OG ED
GE DA
O TA T
N CA GR
A T A SOURCES A
K

A
T D A P
A
D

S
DA

ER
TA

C U

M
O
NS NS
UM CO
ER TA
S DA

5
Data Fabric Architecture
While the data sources are shown in the middle of the
Data Fabric Architecture diagram, they are distributed
across different environments. The sources can have many
formats. They can even be external to an enterprise.

The Data Catalog layer represents the evolution of


metadata management. Using the semantic formalisms
of rich ontologies and controlled vocabularies, it can
connect any source of data and capture a variety of
rich metadata to facilitate shared understanding across
data assets.

The Enrichment Layer: Rich semantics and inherent


flexibility of the Data Catalog implemented as a
Knowledge Graph make it possible to take advantage
of different types of AI approaches and algorithms to fur-
ther enrich its content. The human users of the
The Governance Layer: This architecture requires establishment
catalog participate in the enrichment assisted by the
and broader use of data governance and standards. Standards
AI. Gartner calls the output of the continuous analysis,
and best practices must support layering of metadata to accom-
knowledge discovery and augmentation of the information
modate different contexts, stakeholders and use cases. Ontolo-
active metadata.
gies to assure data quality and compatibility of metadata are
The Data Delivery layer of the architecture offers open essential to joining information across the enterprise into a
APIs that can be leveraged by all stakeholders and systems common model.
in the enterprise.

6
EXECUTIVE
FOCUS ON MANAGEMENT

PR
OC
ESS CENTR
IC TopBraid EDG TM is an Enterprise Knowledge
Graph Infrastructure for Data Governance
C E N T RIC
M E TA DAT

UE
A

AL

EN V
C

T
These integrated knowledge graphs
TR
TopBraid EDG, is a rich set of
IC PROJEC

REPRESENTATIVE
FOCUS ON ACURITY
APPLIED
FOCUS ON INSIGHTS interconnected knowledge graphs are ready to be elaborated with your
expressing knowledge about how enterprise specific knowledge.
data is used and managed in
• After this enhancement, your
TopBraid EDG delivers the enterprise ecosystem. enterprise is ready for implementing
comprehensive, connected Using Knowledge Graphs, comprehensive data governance.
data governance. TopBraid EDG supports data • TopBraid EDG intelligently
•E
 xecutive Governance catalogs and semantic modeling connects all of your
Putting control processes and — critical components of the information assets!
policies in place.
data fabric.
• With EDG, you can choose data
• Representative
 Governance governance packages to support
Focused on having models of the
information you will be capturing. a comprehensive and staged
approach to data governance.
•A
 pplied Governance
Using the information you have
captured to address specific needs.
Learn more at: topquadrant.com

7
How do Data Catalogs Built
As Knowledge Graphs Enable
an Enterprise Data Fabric?

Semantic Modeling and Graphs are Critical


Components of the Data Fabric Design
“Semantic modeling via graph technology and
metadata-driven augmented data integration are
two emerging, yet critical, technology components
of the data fabric design because they support the The use of knowledge graph technology
connections among the other components and enable makes it possible for a data catalog
the initial design to evolve over time.” to support the data fabric.
— Gartner, “Emerging Technologies: Data Fabric Is the
Future of Data Management,” December 2020

8
Data Asset Information in a Catalog is a Graph of Connections

Dataset

derived from
Party Periodicity

publisher frequency

Data Elements Dataset Dataset Group


element of part of

licensed under format

topic
License Type Format
Concept

9
Well Designed Controlled Vocabularies Are Essential for Effective Data Catalogs

Role of Specialized Graphs


Controlled vocabularies used Relatively small and focused Large and often taxonomic in nature
Subgraphs within the overall knowledge in cataloging range from: • Data source formats • Topics
graph can serve different purposes. • Security classifications • Categories
Some graphs contain models of the domain • License types • Geo Locations
of knowledge while others contain actual data.
If you are familiar with relational databases, Dataset
you can think of it as schema versus data.
derived from
Just as with relational databases, some data Party Periodicity
in a knowledge graph is about core objects
of interest while other data is about auxiliary publisher frequency

objects that are used to describe core objects


In the case of data catalogs, core objects are Dataset Dataset Group
Data Elements
data assets and closely related assets such as element of part of
technology and enterprise. Auxiliary data
are controlled vocabularies that are used to
type Excel
describe the core objects. These vocabularies topic format
may be structured hierarchically as taxonomies.
type
Specialized graphs that are separate yet con- Concept Format CSV
nected — this is how a knowledge graph can
support capturing different types of compli- type
mentary information. Such information can JSON
include ontologies and taxonomies, reference
data, operational data, and anything in between.

10
API Requirement for Data Catalogs

Openness of metadata is increasingly becoming Example: Open APIs in TopBraid EDG


mandatory. For a catalog to be a key enabler of data Standards-based:
fabric, it must provide open APIs. • SPARQL and GraphQL
For Data:
What makes an API “open?” • Get me information about a dataset
An open API is: And Models
• Standards based • What type of information can I get about a dataset?
• Available through a web protocol • What can I conclude from this information?
• Grows with the catalog to accommodate new assets Read and Write
and new metadata • For data and for models
• Provides access to ALL information
— not to just the “pre-built” metadata, but to any Webinar: Semantic Knowledge Graphs are
metadata that is added to the catalog.
the Governance Architecture of the Future

Blog: Querying TopBraid EDG with GraphQL

11
Creating and Using a Data Catalog

Steps for Creating a Data Catalog Data Catalogs and Business Glossaries
1) Identify the initial Data Sources to be Inventoried Data catalogs:
• Simply document data assets
2) Decide What Metadata is Important
• Can’t provide any business or operational context
3) Establish Controlled Vocabularies for Metadata Items • Technical metadata only
4) Decide on the cataloging processes
Business glossaries:
5) Automate capture of technical metadata • Provide semantics
• Identify business context
6) D
 efine roles, responsibilities and processes for enriching
the catalog with business and operational metadata • Contain shared, technology independent meaning of data
•A
 way to “ground” data from diverse data sources using
7) Test with users, iterate and extend common business terms

12
How to Connect Data Assets Knowledge Engines Can Help Automate
and Business Terms? Connecting Data Assets
Manually Using Machine-Processable Standards for:
• Business stewards, data stewards
•P
 roviding a common language of meaning
• A very large task
•D
 escribing technology assets, information
Automatically resources, policies, processes and stakeholders
• Using column/data element names •C
 apturing what AI algorithms can be used
• Using data statistics and for what purposes
• Using data samples
• “Human in the loop” curation

Age Group

13
Information Lineage — A Screenshot of one of the Applications Provided By TopBraid EDG

14
Key Take Away Points

• Data Fabric is a new architectural approach •C


 apabilities and benefits of Data Catalogs built as
powered by Data Catalogs built as Knowledge Graphs are:
Knowledge Graphs.
• F lexibility — can express any data and metadata
• T he use of the Knowledge Graph technology •C
 omposability — ease of incremental evolution
makes Data Catalogs powerful, flexible and
semantically rich. •C
 onnectivity — bridge all data and metadata
“silos” for seamless data governance
•O
 penness of metadata has become mandatory. • F uture proof — based on standards
Data Catalogs must be standards-based and open.
•E
 xpressivity — the most comprehensive
“out of the box” models
• I ntegrability — the most complete, open
and flexible APIs
•S
 mart — integrate reasoning and machine learning

Contact us at info@topquadrant.com to learn more about how you can implement


a data catalog as a knowledge graph to support your enterprise data fabric.

15
200+
Established in 2001, TopQuadrant offers
markets the most complete and mature
Information Governance solution —
based on Knowledge Graphs.
CUSTOMERS
TOPBRAID EDG
40%
70 +
• Released in 2016
• Multiple packages = FORTUNE

500
• Available on premise and as SaaS

TOPBRAID COMPOSER TOPBRAID EDG


• Released in 2006 CUSTOMERS

THOUGHT LEADERSHIP
•S
 trong commitment to standards-based
approaches to data semantics
60%
= PUBLIC LISTED

200
Learn more at: Customers
topquadrant.com

16

You might also like