You are on page 1of 16

1.

Metadata Management
1.1 Metadata Management Life Cycle
Metadata management Life Cycle defines various phases associated with the end-to-end metadata
management process starting from planning through maintenance till retirement of metadata

1.1.1 Governance and Planning


Governance and Planning involves initial planning, defining the objectives for metadata management
process, identification of owners and associated roles and responsibilities for each of the stake-holders.
The ability to ingest and explore any data including structured, semi-structured and unstructured
data. Given this usage, it is challenging to enforce a strict control and governance on the data being
ingested into the Data warehouse environments and hence Governance of Metadata is of relatively
lesser significance in this context.

1.1.2 Metadata Content


Metadata content defines the types of metadata that need to be captured as part of the metadata
management process.
Type of Metadata

Definition / Description

Business Metadata

Business Metadata defines the data in the Warehouse in user friendly terms.
Business Metadata captures what data is stored in the Warehouse, where the
data is sourced from, how the data is used and its relationship to other data in
the Warehouse.

Technical Metadata

Technical Metadata defines the data, objects and processes in the Warehouse
from a technical point of view. Technical Metadata captures system metadata

such as tables, data elements, indices, partitions in a relational database, files


stored in the cluster, security classification for the data elements etc.

Operational Metadata

Business Rules &


Transformation Rules

Operational Metadata (or sometimes also referred to as the Process Metadata) is


the data about the processes in the Warehouse. Operational Metadata captures
process schedules, frequency of batch processes, status summary and usage
statistics for various processes etc.
Business Rules and Transformation Rules related metadata capture the rules
applied on data elements during the data acquisition, data ingestion or data
extraction and loading processes in the Data Warehouse.
In some cases, this metadata can also be used to dynamically process and load the
source data feeds into the Data Warehouse.

System Statistics

System Statistics related metadata captures data related to system resource


utilization for proactive monitoring and maintenance within a Data Warehouse
environment.

Metadata for Downstream


Process

Metadata for downstream processes captures the Technical Metadata including


mapping of data elements from the Warehouse to downstream processes or
applications such as BI tools, analytical models or any other downstream
applications.

1.1.3 Metadata Capture Strategy


Metadata capture strategy defines the process and / or tools that need to be used for capturing the
required metadata. Strategy for metadata capture can include multiple tools / approaches based on
the type of data and feasibility constraints. The strategy outlines the guidelines for using an
appropriate tool or mechanism for identified use cases.

1.1.4 Metadata Model and Integration


Metadata Modelling defines the data modelling strategy for the metadata repository. Metadata
Integration defines the approach for integration of various types of metadata including integration
from various metadata repositories, if applicable.

1.1.5 Metadata Visibility


Metadata Visibility defines the processes associated with enabling access to the metadata elements,
types of analyses and use-cases for usage of metadata by end-users.

1.1.6 Metadata Standards and Quality


Metadata Standards and Quality are of relatively lesser significance compared to the other phases in
the context of Data Warehouse. Metadata is created once and is occasionally used by a limited set of
users. Hence typically Organizations do not invest in tracking or enhancing the quality of metadata
captured either through an automated process or through a manual process.

1.1.7 Maintenance and Retirement


Maintenance and Retirements define the following aspects associated with metadata management processes.
Purging and archival or obsolete metadata (Operational Metadata for example)
Restructuring and enhancements to the Metadata Model
Processes and Governance for ensuring accuracy and timeliness of the metadata captured with on-going
changes and project releases

1.2 Metadata Content


This section details the list of recommended metadata data elements that need to be captured for various types of
Metadata as part of the Metadata Management strategy for the environemnt.

1.2.1 Business Metadata


Following are the recommended Business Metadata data elements that need to be captured for the Business
metadata. The Conceptual Model , Logical model information are also stored in the Business metadata for the ease
for usage and to understand the impact analysis for any business changes

Metadata Data Elements

Level

Source Feed Business Name

Source Feed

Source Feed Business Description

Source Feed

Source Feed Usage

Source Feed

Source Feed Group Name

Source Feed

External Data Source Indicator

Source Feed

Source Host Code Name

Source Feed

Source Feed Business Owner / Contact

Source Feed

Source Feed Technical Contact

Source Feed

Source Column Business Name

Source Column

Source Column Business Description

Source Column

Target File Business Name

Target File

Target File Business Description

Target File

Target File Usage

Target File

Subject Area

Target File

Data Security Classification

Target File

Target Column Business Name

Target Column

Target Column Business Description

Target Column

Target Column Synonym(s)

Target Column

1.3 Technical Metadata


Following are the recommended Technical Metadata data element that needs to be captured for the ODS, Data
warehouse, Data Marts, Source Systems. This should captured for all source, target and extracts provided
Level

Metadata Data Elements

Source Feed

Source Feed Name

Source Feed

Source Database Name

Source Feed

Source Table Technical Name

Source Feed

Source Data File Name

Source Feed

Source Feed Group Name

Source Feed

Source Host Type

Source Feed

Source System Code Name

Source Feed

Source Feed Format Type

Source Feed

Source File Layout Definition (XSD / JSON etc.)

Source Feed

Source Trigger File Name

Source Feed

Source Trigger File Type and Format

Source Feed

Source Encryption Method

Source Feed

Source Feed Profile Path

Source Feed

Source Feed Delivery Frequency

Source Feed

Exception Days for the Source Feed

Source Feed

Expected Delivery Time of the Source Feed

Source Feed

Expected Number of Records

Source Feed

Number of Columns (Source Feed)

Source Column

Source Column Technical Name

Source Column

Source Column Data Format

Source Column

Source Column Data Type

Source Column

Source Column Data Length

Source Column

Required / Optional (NULL) Indicator

Target File

Target File Name

Target File

Target File Format Type

Target File

Target File Layout Definition (XSD / JSON etc.)

Target File

HDFS Location (Directory Path)

Target File

Target Data Security (ARD Role)

Data Source

Ingestion Method / Extraction Method

Target File

Archive Location

Target File

Target Encryption Method

Target Object

Target Resource Size

Target File / Table

Update Frequency

Target File / Table

Update Type

Target Column

Target Column Technical Name

Target Column

Target Column Data Format

Target Column

Target Column Data Type

Target Column

Target Column Data Length

Target Column

Expression / Transformation (Source Target)

Column

Column Delimiter Used

Column

System of Record / System of Reference

1.3.1 Operational Metadata


Following are the data elements recommended to be captured as part of the Operational Metadata. The
Operational Metadata captured does not vary based on the source system of the type of the source data.

Operational Metadata data elements can be classified into 2 broad categories Data Movement and Data Usage,
for each of the source data types.
Following are the recommended Operational Metadata data elements that needs to be captured
Metadata Data Elements

Structured

Unstructured

Source Feed Delivery Time SLA

Source Feed Delivery Time (Actual)

Source Feed Exception Indicator

Source Feed Exception Details

Number of Records Received

Expected Number of Columns

Actual Number of Columns Received

Data Load Rule Name

Data Load Rule Threshold Type

Data Load Rule Failure Value

Data Load Rule Last Failure Date and Time

Business Date

Last Data Load Date and Time

Data As of Date

Job Name

Job Description

Job Location

Job Type (Batch / Real-Time etc.)

Job Execution Frequency

Job Execution Start Time

Job Execution End Time

Job Status

Job Completion Time SLA

Job Execution Exception Indicator

Data Movement Metadata

Job Execution Exception Type

Job Execution Exception Details

Number of Success Records

Number of Exception Records

Number of Rejected Records

Access Count

Last Access Date and Time

Last Access User / Process

Number of Queries / Extractions

Last Extraction Date and Time

Output Protocol (FTP, Tumbleweed etc.)

Data Usage Metadata

1.3.2 Business Rules and Transformation Rules


Following are the recommended Business Rules and Transformation Rules related Metadata data elements that
needs to be captured
Metadata Data Elements

File Level

Column Level

Rule Name

Rule Type

Rule Level Name

Rule Threshold Type

Alert Threshold Value

Abort Threshold Value

Rule Default Value


Trigger Field Name

Rule Filter Condition

Rule Parameter Name

Rule Parameter Value

1.3.3 System Statistics


Following are the recommended System Statistics that needs to be captured. The metadata data elements listed
are high level statistics which can comprise of one or more detailed statistics. The detailed list of system statistics
that can be captured depends on the Operating System, monitoring tools used etc. The table below provides
examples of detailed statistics for each category
Metadata Data Elements

Examples

CPU Utilization

CPU Utilization of System Processes, CPU Utilization of Applications / Users,


CPU Idle Time etc.

Memory Utilization

Total Physical Memory, Memory used for Swap, Memory Used for Caching
etc.

Storage Utilization

Total Space Available, Utilized Space

I/O Utilization

Number of Transfers per Second, Data Reads (kB/s), Data Writes (kB/s), I/O
Wait Time, Reads per Second, Writes per Second etc.

1.4 Metadata Capture Strategy


In the context of Data Warehouse, Metadata is captured only in the production environment
The approach or strategy for capturing the Metadata for the Warehouse can be broadly classified into 4 categories
as follows
Metadata capture for structured data
Metadata capture for semi-structured / unstructured data sources
Metadata capture for downstream processes from Warehouse
The following table summarizes the metadata capture strategy by type of Metadata
Metadata Type

Options

Business Metadata

Sourced from Commercial BI Metadata Repository


Manual Capture

Technical Metadata

Sourced from Commercial BI Metadata Repository


Auto-Capture (from system tables / repositories)
Manual Capture

Operational Metadata

Published to Metadata Repository


Auto-Capture (from Application Repositories)

Business Rules & Transformation Rules

Custom Manual Capture (through the portal)

System Statistics

Auto-Capture

Metadata for Downstream Processes

Manual Capture

1.4.1 Business Metadata


Business metadata provides the data definition for each of the data elements processed and loaded into the
Warehouse. The metadata management process should provide a mechanism for manual capture of Business
Metadata during the design phase.
Following are the general guidelines for capturing the Business Metadata
For structured data sourced
o If the Business Metadata is available within the Source Metadata Repository, the required data
elements should be sourced and loaded into the Data Warehouse Metadata Repository
o If the Business Metadata is not available within the Source Metadata Repository, the data owner
responsible for the movement of the data from Source to Data Warehouse should provide the
business metadata. The metadata can be captured manually using a customized template used
for Metadata Management process.
Data Stewards or Analysts responsible for capturing (creating) the business metadata
should be able to upload the metadata through a self-serviced portal. This would enable
authentication and authorization for the users capturing or creating the metadata.
Alternatively, Data Stewards or Analysts can be provided with a UI on the portal for
creating the business metadata that cannot be sourced programmatically.
For any other source data feeds and target objects (in all cases), business metadata should be captured
using the manual capture process. When the data is captured through the manual process
o Metadata certified , validated and released
The table below captures the details of metadata capture by layer for Business Metadata
Layer

When

Metadata Capture Strategy

Responsible Party

Data Access Layer

Design Phase

Manual Capture

Business Analysts

Data Storage Layer

Design Phase

Manual Capture

Business Analysts

1.4.2 Technical Metadata


Technical metadata captures the details of how, what and where the data elements are stored within the Data
Warehouse environments. Given the multitude of options for modelling and storing the various types of data in a
Data Warehouse, the Technical Metadata captured varies based on the type of data being sourced or ingested into
the Data environment.
The table below captures the details of metadata capture by layer for Technical Metadata
Layer

When

Metadata Capture Strategy

Responsible Party

Design Phase

Auto-Capture

Data Stewards

Design Phase

Manual Capture

Data Stewards

Design Phase

Auto-Capture

Data Stewards

Data Access Layer


Data Landing Layer

Data Integration Layer

Design Phase

Manual Capture

Data Stewards

Auto-Capture

Data Stewards

Data Storage Layer

Design /
Development Phase
Design Phase

Manual Capture

Data Stewards

1.4.3 Operational Metadata


Operational Metadata captures data from the auditing and logging for data acquisition, data transformation and
loading processes, BI usage data, details around data integration job and report execution times etc.
The approach and guidelines for capturing the Operational Metadata depends on the type of operational data
being captured and can be broadly classified into following categories
Operational Metadata for Data Movement
Operational Metadata for BI and Analytics
The Metadata Management process implemented should capture the Operational Metadata for data movement
during the actual job execution. The metadata should be captured programmatically without any manual
intervention. Operational Metadata for Data Usage however can be extracted on a period basis and can be
scheduled.
Metadata Repository
An Operational Metadata repository should be created for the Data Warehouse
It is recommended to implement a metadata repository at least for Operational Metadata irrespective of
the Data Modelling strategy adopted
If an integrated Metadata Repository is implemented, the Operational Metadata can be part of the
repository (subject area approach)
Guidelines
Following are the general guidelines for capturing Operational Metadata for Data Movement
A common approach is used for capturing Operational Metadata for structured, semi-structured and
unstructured data
Metadata capture should be event driven and required data elements should be published into the
metadata repository as soon as the data movement process / cycle completes
Data Ingestion, Data Extraction and the Data Load processes should have a mechanism to publish the
required data elements into the Operational Metadata repository
o The data elements may either be published using pre and post processing scripts for the batch
processes
o Alternatively, a control script can be continuously monitor the batch process and publish the required
data elements into the operational metadata repository
Following are the general guidelines for capturing Operational Metadata for BI and Analytics
Operational Metadata for BI and analytics will be primarily sourced from the application repositories
Metadata capture can be batch oriented, with ability to support intra-day batches

The table below captures the details of metadata capture by layer for Operational Metadata
Layer

When

Metadata Capture Strategy

Data Integration Layer

Data Movement

Auto-Capture

Data Storage Layer

Post Go-Live, on
regular basis

Auto-Capture

Responsible Party

1.4.4 Business Rules & Transformation Rules


Business Rules and Transformation Rules applied for the data sourced into the Data environment is always
captured through a custom manual process. This section provides the general guidelines for capturing the Business
Rules and / or Transformation rules based on the type of Data
Structured Data
Business Rules and Transformation Rules should be captured as separate rules
Applicable Business Rules and Transformation Rules should be captured at both Source Table level as
well as Source Column Level
Linkage between the Business Rules and Transformation Rules should be established through the source
object
Multiple rules may be associated with a given Source Table or Source Column
Rules may either be captured and stored in the metadata repository (database) or maintained as Excel
files associated with the source object
Semi-Structured / Unstructured Data
Business Rules and Transformation Rules should be captured as separate rules
Rules should be captured at source feed level
Multiple rules may be associated with a given source feed
It is recommended to capture the rules using Excel files associated with the source objects
o Business rules can be optional at field level
o Transformation rules applicable to field level may be captured in the Excel files
Business Rules and Transformation Rules related metadata is dependent on the Technical Metadata for the source
data feeds or source data elements. In order to ensure data quality and accuracy of the metadata, it is
recommended to capture the business rules and transformation rules metadata through a UI on the portal with
following checks and balances
Source data feeds and data elements should be pre-populated from the Technical Metadata available in
the metadata repository
End-users should not be able to edit or modify the source data elements
UI can have basic validations to ensure mandatory metadata elements are captured
UI should also have a provision to allow users to upload a file with the rules either at source data feed
level or at source data element level
Users should be able to edit update or delete any rules entered through the UI

The table below captures the details of metadata capture by layer for Business Rules and Transformation Rules
related Metadata
Layer

When

Metadata Capture Strategy

Responsible Party

Data Integration Layer

Design Phase

Manual Capture (Custom Process)

1.4.5 System Statistics


System Statistics for the Warehouse environment should be captured using automated capture from the system
logs or through the use of system monitoring tools and utilities.
Following are the general guidelines for capturing System Statistics
System statistics should always be captured using an automated process
Key utilization statistics such as CPU or memory utilization should be tracked continuously
Utilization statistics for other resources such as storage may be captured on a periodic basis
The table below captures the details of metadata capture by layer for System Statistics
Metadata Capture
Layer
When
Responsible Party
Strategy
Data Landing Layer

Post Go-Live, on
regular basis

Auto-capture

System Administrators

Data Integration Layer

Post Go-Live, on
regular basis

Auto-capture

System Administrators

Data Storage Layer

Post Go-Live, on
regular basis

Auto-capture

System Administrators

1.4.6 Metadata for Downstream Processes


Metadata for the downstream processes comprises of business metadata for the target objects, technical
metadata for the target objects including the lineage from warehouse/ Hadoop to the downstream data
repositories (data marts/ Hive / HBase etc.), BI tools or analytical models. This metadata is required to enable
complete lineage analysis from the source systems to the target applications.
Following are the general guidelines for capturing the metadata for downstream processes
Business Analysts or the data stewards responsible for moving the data from the Data Warehouse to the
downstream applications should be primarily responsible for capturing the Business Metadata elements
Technical SMEs / technical point-of-contact for the downstream applications should be primarily
responsible for capturing the Technical Metadata including the lineage metadata
Any business rules and transformation rules applied should be captured at both Entity and Attribute level
Any business rules and transformation rules applied should be captured at both Entity and Attribute level

The table below captures the details of metadata capture by layer for System Statistics
Layer

Metadata Capture
Strategy

When

Responsible Party
Business Analysts

Data Storage Layer

Design Phase

Manual Capture

Data Analysts
Data Stewards

1.5 Metadata Modeling and Integration


Metadata modelling defines the approach or data modelling strategy for the metadata repository. This section
describes various options for metadata modelling and provides a comparative analysis between each of the
options.

1.5.1 Metadata Refresh


Metadata Refresh defines the process and frequency for capturing and updating the metadata on an on-going
basis. The processes and frequency of Metadata refresh varies based on the type of the Metadata and the
environment for which Metadata is being captured and refreshed.
The table below provides a consolidated view of the Metadata refresh strategy for each of the environments
Type of Metadata

Business Metadata

Technical Metadata

Operational Metadata

Description

Metadata is created

Initial Metadata captured during Design Phase

Metadata needs to be updated continuously whenever there is a change to


source data feed or target structures, enforced as part of the code release
process

Metadata is created

Metadata that needs to be captured manually is created during the Design


Phase

Metadata captured using automated process is initially created during the


development phase and certified before code release

Metadata needs to be updated continuously whenever there is a change to


source data feed or target structures, enforced as part of the code release
process

Data Movement related Operational Metadata is captured using event


driven approach, but on ad-hoc basis

Data Usage related Operational Metadata can be captured on a need basis


(Optional)

Business Rules and


Transformation Rules

System Statistics

Metadata for Downstream


Processes / Applications

Rules related Metadata should be created

Initial metadata should be created post the Technical Metadata is sourced


into the repository

Metadata should be updated on a continuous basis, as and when there is a


need for change using the custom manual approach defined

Captured using automated process on a need basis

Need to captured and maintained on a regular basis only if required (for


usage based charge-back mechanism for example)

For any downstream applications designed, metadata should be created in


environment
Metadata should be captured during the Design phase

1.6 Metadata Visibility


Visibility or access to the Metadata captured for the Data Warehouse should be enabled only through a standard
intranet portal. The portal should provide the following functionalities
Provide a layer of abstraction for the metadata capture, integration and storage aspects
Ability to authenticate users accessing the portal
o It is assumed that there is no need for user authorization (data security)
Ability to search on the metadata captured, using any of the use-cases identified
o Provide a layer of abstraction between the User Interface and the underlying data elements on
which the search operation is performed. For example a basic search on UI for table name
could perform a search on table technical name, table business name, table business description
and the source data file name.
o Provide ability to perform advanced search using a combination of search criteria. For example
search for a given table name within a subject area for a given Market.
o Pagination of the search results for better readability
o Ability to sort the search results on predefined criteria including search relevance (this use case
may need further discussion and elaboration)
o Should provide ability to export the search results to Excel for offline analysis
Ability to establish data lineage for data entities and elements within the Data Warehouse
o Should support bi-directional lineage analysis
o Completeness and quality of data lineage information will be dependent on the accuracy and
completeness of the metadata captured either through automated process or through the
manual capture process
Ability to generate and view standard operational reports
Following are the general guidelines with respect to the Metadata Visibility
End users (data analysts for example) for metadata should never be provided direct access to the
metadata repository database tables or the Excel files within Data Warehouse
Only system administrators and technical SMEs for the Data Warehouse may have direct access to the
metadata repository including the physical storage
Access to metadata environments should be enabled through separate user interfaces separate

portals, sub-sites etc.

1.6.1 User Groups and Associated Usage


This section captures the details of the target user groups who would need access to the portal and their
associated usage of the portal, in each of the environments

1.6.2 Metadata Analysis & Usage


The Metadata Repository portal supports the following types of analysis and usage of the metadata captured.
Lineage Analysis
Lineage analysis is one of the key requirements for the proposed Metadata Management solution. The metadata
captured should support the following types of lineage analysis
For structured data source extracted from Source, the metadata in Data Metadata repository should
support bi-directional lineage analysis from the tables in Source/ Warehouse to the Data Warehouse or
any downstream applications from Data warehouse
o The metadata should support lineage analysis at table and column level
o For each of the tables / Files from Source, the System of Record information for the original
source feed may be made available as additional information. However, the lineage from the
original source data feed to the Source Files/ tables will be out of scope for lineage analysis
o The completeness of lineage metadata will be dependent on the process implemented for
capturing the metadata for downstream processes / applications
For semi-structured or unstructured data sources, the metadata captured should support lineage analysis
as follows
o Bi-directional lineage analysis at object level (web files, video files etc.)
o For data sources like IVR where each transaction can potentially contain an audio file, lineage
analysis should capture the linkage of audio files to the transaction and the source feed
o For structure metadata captured as part of unstructured data sources, the metadata should
support lineage analysis at column (data element) level
Data Usage Analysis
Data usage analysis primarily provides ability to track what data within the Warehouse is being used, frequency of
usage and the access log of end-users accessing the data. Data usage analysis helps in identifying the frequency of
data elements being accessed, improve the data modelling and restructure the data to provide easier and quicker
access to end-users.
Data Analysis usage requires the Data Usage related operational metadata to be captured as part of the metadata
management process. Some of these operational metadata for structured data can be captured through
automated processes either from the system logs or system tables. However, for semi-structured or unstructured
data capturing operational metadata may require some level of tracking at the operating system level and is
subject to feasibility, specific use case requirement and the decision to implement tracking user activity at such
detailed level.

BI Usage Analysis
Operational Metadata required for supporting BI usage analysis will be primarily sourced from the application
metadata repositories. BI usage analysis helps to understand the user behaviour on BI tools and applications and
this identifying potential opportunities for redesign and / or optimization.
Following are some examples of analyses typically performed on BI Usage
Number of users executing reports on a daily / weekly basis
Average number of reports executed on a daily / weekly basis
Number of times a report is run in the last x days
Audit Analysis
Audit analysis requires Operational Metadata to be captured for the data integration and load processes. Audit
analysis primarily helps to understand the effectiveness of the data movement and data loading processes and
helps to identify potential opportunities for redesign and / or optimization.
Examples or audit analyses reports are as follows:
Average execution times for batch processes, by subject areas
Long running jobs at the potential risk of missing data loading SLAs (for proactive tuning)
Jobs exceeding the average execution times on a daily / weekly basis
Average number of errors or exceptions on a periodic basis
Frequently occurring errors or exceptions by Source Feed or Subject Area
1.7 Metadata Maintenance and Retirement
Metadata Maintenance and Retirement process will be closely related and dependent on the Governance and
Planning for Metadata. For the `Warehouse, Metadata Maintenance and Retirement strategy need to be cater to
the differences in target audience, data movement strategy and the data retention strategy for each of these
environments.
Following are the general guidelines for Metadata Maintenance and Retirement:
Metadata will be captured only for the Shared Area
No metadata will be captured or maintained for user specific directories (Private Area)
Metadata capture and updates for any metadata captured using manual or custom process need to be
enforced as part of the code release checklist and should be up-to-date at given time
Technical metadata captured using automated process also should be maintained completely and
accurately for all objects
Following metadata captured using an automated process may be refreshed on a need basis
o Operational Metadata
o System Statistics
When data is purged, all metadata associated with that data / data objects should also be purged from
the metadata repository