You are on page 1of 36

Hitachi NEXT 2018

Automating Onboarding Data with Metadata Injection

Contents
Page 2: Introduction to Metadata Injection
Page 7: Guided Demonstration Overview: Metadata Injection
Page 13: Guided Demonstration – Standard Metadata Injection
Page 20: Guided Demonstration – Push / Pull Metadata Injection
Page 27: Guided Demonstration – 2 - Phase Metadata Injection
Page 36: Summary of Metadata Architectures
Introduction to Metadata Injection
Metadata is traditionally defined and configured at design time, in a process known as hard-coding,
because it does not change at run time. This static ETL approach is a good one to take when you are
onboarding just one or two data sources where you can easily enter metadata manually for your
transformation. However, this hard-coding approach presents some complications, including:

• Time consumption
• Repetitive manual tasks
• Error-prone solutions
• High labour costs of designing, developing, and supporting a fragile solution
• Added risk when predictable outcomes are jeopardized
Metadata injection is the dynamic ETL alternative to scaling robust applications in an agile environment.
One transformation can service many needs by building a framework that shifts time and resources to
runtime decisions. This operation dramatically reduces upfront time-to-value and flattens the ongoing
investment in maintenance. When you are dealing with many data sources that have varying schemas,
try metadata injection to drastically reduce your development time and accelerate your time to value.

HITACHI is a trademark or registered trademark of Hitachi, Ltd. 2


Data integration is the main domain of metadata injection. As illustrated below, metadata injection is
useful in cases that face one or more of the following challenges:

• Many datasources
• Different naming conventions
• Similar content
• Dissimilar structure
• Common destination

The ETL Metadata Injection step can be used in transformations to inject metadata into another
transformation, normally with input and output steps for standardizing filenames, naming or renaming
fields, removing fields, and adding fields.

Note: Pentaho’s metadata injection helps you accelerate productivity and reduce risk in complex data
onboarding projects by dynamically scaling out from one template to many transformations.

HITACHI is a trademark or registered trademark of Hitachi, Ltd. 3


Pentaho Data Integration (PDI) now has over 75 steps that can be templated to inject metadata or
characteristics that can make small or large value changes, allowing each run to be different from the
previous.

https://help.pentaho.com/Documentation/8.1/Products/Data_Integration/T
ransformation_Step_Reference/ETL_Metadata_Injection/Steps_Supporting_M
DI

ETL integration development takes time for gathering requirements, building, testing, documenting,
deploying, and monitoring production. Rules, requirements, and data itself may change, over time. If
that happens, the current rules may no longer apply or new rules may need to be added to the existing
transformation to continue working. We recommend using flexible, data-driven ETL patterns to make
your data integration transformation powerful and adaptable to changing business rules without going
through a development cycle.

Data Streaming
Since version 5.1, this step is capable of streaming data from one transformation into another.

To pass data from your template transformation (after injection, during execution) to your current
transformation, specify Template step to read from. You can also specify the expected output fields
easily design the steps which come after the ETL Metadata Injection step.

To pass data from a source step into the template transformation (again, after injection) you can specify
Streaming source step and Streaming target step in the template transformation.

Metadata injection refers to the dynamic passing of metadata to PDI transformations at run time to
control complex data integration logic. The metadata (from the data source, a user defined file, or an
end user request) can be injected on the fly into a transformation template, providing the “instructions”
to generate actual transformations. This enables teams to drive hundreds of data ingestion and
preparation processes through just a few actual transformations, heavily accelerating time to data
insights and monetization. In data onboarding use cases, metadata injection reduces development time
and resources required, accelerating time to value. At the same time, the risk of human error is reduced.

HITACHI is a trademark or registered trademark of Hitachi, Ltd. 4


Data integration can be made more flexible and reactive by building rules that can be injected into the
transformation before running, and by using the appropriate parameters to pass into ETL jobs. For
example:

• Passing in different filenames (paths and filenames can be different for each run)
• Passing different values into a custom database structured query language (SQL) statement to
allow for different behaviours (from different tables’ names, and where clause field name
values)

ETL Metadata Injection Step

The ETL Metadata Injection step exposes the metadata properties of your ‘template’ steps. This step
enables you to map existing metadata properties to new injected metadata properties.

HITACHI is a trademark or registered trademark of Hitachi, Ltd. 5


OPTION DESCRIPTION
Transformation In this section of the dialog, you can specify the transformation to use as a
template template. When you have specified a transformation, you can use the
Validate and Refresh button. The Edit button will open the specified
template in a new tab in Spoon.
Template step to If you specify a step from the template here, then the output of the ETL
read from (optional) Metadata Injection step will be the output from the source step.
Optional target file For debugging or transformation generation, you can save the resulting
(KTR after injection) transformation filename, after metadata injection, to a file. If you want, you
can specify a file name, result.ktr for example.
Don't execute If you prefer to not execute the resulting transformation (after metadata
resulting injection), enable this option.
transformation
Field mapping You can select any row in the metadata tree table with your mouse, which
pops up a source step and field selection dialog.

HITACHI is a trademark or registered trademark of Hitachi, Ltd. 6


Guided Demonstration Overview: Metadata Injection
Introduction These Guided Demonstrations outline the ‘use case’ for Metadata Injection.
Onboarding data workflows follow repeatable patterns, with just different
metadata properties.

• Scenario 1 - Hard coded delimiter


• Scenario 2 - ETL Metadata Injection of metadata

Objectives Once the repeatable pattern has been defined in a template, the ETL Metadata
Injection step, exposes their metadata properties, which can then be mapped
to the corresponding injected source stream field.

• Outline the workflow for standard data onboarding.


• Configure an ETL Metadata Injection Transformation, and Template.

Scenario 1 – Static ETL


In this scenario, onboarding the files would require a CSV file input step for each of the different
delimiters.

HITACHI is a trademark or registered trademark of Hitachi, Ltd. 7


1. Double-click on the CSV File Input steps to display the metadata properties.

Each datasource requires its own Transformation.

HITACHI is a trademark or registered trademark of Hitachi, Ltd. 8


Scenario 2 – Inject ETL Metadata Properties
Template

For this scenario, the onboarding of the data is achieved with a template.

Note: The steps in the template that define the scope of the metadata injection properties.

1. Double-click on each of the steps:


• CSV file input
• Select values
The metadata properties will be injected at RUN time.

2. Double-click on the table output step.

HITACHI is a trademark or registered trademark of Hitachi, Ltd. 9


ETL Metadata Injection Transformation

The main Transformation:

• Onboards the sales_data.txt into the datastream pipeline


• Injects the required metadata properties from the Data Grid steps
• Maps the Template fields to the Injected Metadata properties
• Executes the Template

HITACHI is a trademark or registered trademark of Hitachi, Ltd. 10


ETL Metadata Injection Step

1. Double-click on the ETL Metadata Injection step


The step is mapped to: tr_metadata_inject_template.ktr

Once mapped the steps and fields in the tr_metadata_inject_template.ktr are


exposed.

The Source steps and fields can now be mapped to the corresponding Target Injection Step.

In this example, the Source step: Using variable to resolve filename and stream Field: filename,
is mapped to the template: CSV file input step and FILENAME datastream field.

2. Examine the other mappings.

HITACHI is a trademark or registered trademark of Hitachi, Ltd. 11


Dynamic ETL Metadata Injection
One of the main challenges faced with Metadata Injection is automating the process. PDI comes with
steps that can extract the required metadata properties from files, streams or database tables without
resorting to code.

File Metadata

The following steps are useful for extracting file metadata.

• Get File Names


• File Metadata - available on the Marketplace (not supported)

Stream Metadata

The following step gives you several the metadata properties associated with the incoming stream.

• Metadata structure of stream

Database Metadata

The following step

• Get JDBC Metadata - available on the Marketplace (not supported)


In this scenario, the file metadata properties have been extracted using the File Metadata step from the
Marketplace (not supported)

HITACHI is a trademark or registered trademark of Hitachi, Ltd. 12


Guided Demonstration: Standard Metadata Injection
Introduction In this example, the data ingestion step is defined within the template
transformation workflow.

Objectives In this guided demonstration, you will:

• Configure Metadata Injection Transformation steps


• Configure Metadata Injection Template

An effective way to learn how metadata injection works is to develop a simple application. The following
steps will guide you through creating a simple application for metadata injection:

Step 1 - Metadata Injection Template


The template is the workflow that utilizes the metadata injection.

HITACHI is a trademark or registered trademark of Hitachi, Ltd. 13


Data Grid – Test data - input
Meta tab: on this tab, you can specify the field metadata (output specification) of the data

Data tab: This grid contains the data. Everything is entered in String format so make sure you use the
correct format masks in the metadata tab.

1. Drag and drop the Data Grid step onto the canvas.
2. Double-click to set the properties as outlined below:

3. For the test data, click on the Data tab.

This is the data ingestion step. It could be a table, flat file, or others.

HITACHI is a trademark or registered trademark of Hitachi, Ltd. 14


Select Values

To configure the Select values step:

1. Drag the Select values step onto the canvas


There’s nothing to configure as the ‘metadata rules’ will be defined in the ETL Metadata Injection
step.

Text File Output


The Text file output step is used to export data to text file format. This is commonly used to generate
Comma Separated Values (CSV files) that can be read by spreadsheet applications. It is also possible to
generate fixed width files by setting lengths on the fields in the fields tab.

1. Drag and drop the Text file output step onto the canvas.
2. Double-click to set the properties as outlined below:

• Just add the path to the output file. Notice the internal variables used to define the filename.
Filename:
${Internal.Entry.Current.Directory}/${Internal.Transformation.Name}_ou
tput

HITACHI is a trademark or registered trademark of Hitachi, Ltd. 15


3. Save the Transformation as: tr_standard_template.ktr.

Step 2 – ETL Metadata Transformation


The Transformation sets the metadata fieldname values that are going to be used in the Metadata
Injection Template.

Data Grid

To configure the Data Grid step:

1. Drag and drop the Data Grid step onto the canvas.
2. Double-click to set the properties as outlined below:

HITACHI is a trademark or registered trademark of Hitachi, Ltd. 16


This introduces two fieldnames into the datastream:

• fieldname
• mdi_fieldname
3. Click on the Data tab:

Each stream field is associated with either integer or string values.

HITACHI is a trademark or registered trademark of Hitachi, Ltd. 17


ETL Metadata Injection
The ETL Metadata Injection step inserts metadata into a template transformation. Instead of statically
entering ETL Metadata in a step dialog, you pass it at run-time. This step enables you to solve repetitive
ETL workloads like loading of text files, data migration, and so on.

1. Drag and drop the ETL Metadata Injection step onto the canvas.
2. Double-click to set the properties as outlined below:

3. Click Browse to locate the Metadata Injection Template.


4. Click on the Inject Metadata tab:
These options define the ‘metadata rules’ for each step in the template. In this example, the Select
values step will change the ‘fieldname’ to ‘mdi_fieldname’ in the meta tab option.

5. Save the Transformation as: tr_standard_mdi.ktr.

HITACHI is a trademark or registered trademark of Hitachi, Ltd. 18


Step 3 - RUN the MDI Workflow

1. Run tr_standard_mdi.ktr.
2. Open the file located at:
C:\NEXT-2018

\NEXT – Automating Data Onboarding with Metadata Injection

\1-Guided Demo – Standard MDI

\tr_standard_mdi_output.txt

HITACHI is a trademark or registered trademark of Hitachi, Ltd. 19


Guided Demonstration: Push / Pull Metadata Injection
Introduction The guided demonstration illustrates 3 modes of Metadata Injection.

Objectives In this guided demonstration, you will configure Push - Pull Metadata Injection.

Step 1 - Template

The template is the workflow that leverages metadata injection.

Dummy – Input Stream

1. Drag the Dummy step onto the canvas.


2. Rename Input stream.

Select Values

To configure the Select values step:

3. Drag the Select values step onto the canvas


There’s nothing to configure as the ‘metadata rules’ will be defined in the ETL Metadata Injection step.

Dummy - Result

1. Drag the Dummy step onto the canvas.


2. Rename Results.
3. Save the Transformation as: tr_push_pull_template.ktr.

HITACHI is a trademark or registered trademark of Hitachi, Ltd. 20


Step 2 – ETL Metadata Transformation

As you have guessed, Push – Pull workflow is a combination of the previous workflows.

Data Grid – Test data - input

To configure the Data Grid step:

1. Drag and drop the Test data - Input step onto the canvas.
2. Double-click to set the properties as outlined below:

HITACHI is a trademark or registered trademark of Hitachi, Ltd. 21


3. For the test data, click on the Data tab.

This is the data ingestion step. It could be a table, flat file, or others..

Data Grid - Metadata

To configure the Metadata Data Grid step:

1. Drag and drop the Data Grid step onto the canvas.
2. Double-click to set the properties as outlined below:

HITACHI is a trademark or registered trademark of Hitachi, Ltd. 22


• This introduces two fieldnames:
o fieldname
o mdi_fieldname, into the datastream.

3. Click on the Data tab:

Each stream fields are associated with either integer or string values.

HITACHI is a trademark or registered trademark of Hitachi, Ltd. 23


ETL Metadata Injection

1. Drag and drop the ETL Metadata Injection step onto the canvas.
2. Double-click to set the properties as outlined below:

• Picks up the Metadata Injection Template.


• These options define the ‘metadata rules’ for each step in the template. In this example,
the Select values step will change the ‘fieldname’ to ‘mdi_fieldname’ in the meta tab option.
3. Click on the Options tab:

HITACHI is a trademark or registered trademark of Hitachi, Ltd. 24


The ETL Metadata Injection step reads streamed data from - pulled - the Result step of the
Template Transformation.

The data is streamed (pushed) from the Test data -input step of the MDI workflow to the Input
Stream step of the template.

Text File Output

1. Drag and drop the Test file output step onto the canvas.
2. Double-click to set the properties as outlined below:

HITACHI is a trademark or registered trademark of Hitachi, Ltd. 25


Just add the path to the output file. Notice the internal variables used to define the filename.
Filename:
${Internal.Entry.Current.Directory}/${Internal.Transformation.Name}_ou
tput

3. Save the Transformation as: tr_push_pull_mdi.ktr.

Step 3 - Run the MDI Workflow

1. RUN tr_push_pull_mdi.ktr.
2. Open the file located at:
C:\NEXT-2018

\NEXT – Automating Data Onboarding with Metadata Injection

\2-Guided Demo – Push - Pull MDI

\ tr_push_pull_mdi_output.txt

HITACHI is a trademark or registered trademark of Hitachi, Ltd. 26


Guided Demonstration: 2-Phase Metadata Injection
Introduction In 2-Phase Metadata Injection, the metadata values are injected into the
template which outputs, via the path set in the Options tab, an
_injected.ktr. The original metadata values have now been transferred or
mapped to the _injected.ktr .

In Phase 2, the _injected.ktr is simply run.

Objectives In this guided demonstration, you will:

• Configure 1st Phase workflow to ‘store’ metadata injection values


• RUN second Phase workflow

Phase 1

Step 1 - Template

The template explicitly renames the fields to the new mdi fields, in the Select values step.

Data Grid - Test data - Input

1. Drag and drop the Data Grid step onto the canvas.
2. Double-click to set the properties as outlined below:

HITACHI is a trademark or registered trademark of Hitachi, Ltd. 27


• This introduces two fieldnames:
o integer
o string, into the datastream.

3. Click on the Data tab:

Select values

1. Drag the Select values step onto the canvas.


There’s nothing to configure as the metadata values are injected into the step.

HITACHI is a trademark or registered trademark of Hitachi, Ltd. 28


Text File Output
1. Drag and drop the Test file output step onto the canvas.
2. Double-click to set the properties as outlined below:

Filename:
${Internal.Entry.Current.Directory}/${Internal.Transformation.Name}

_output_${sequence}

3. Save the Transformation as: tr_2_phase_template.ktr.

Step 2 – ETL Metadata Transformation

The Transformation template is output as a populated template_injected.ktr

HITACHI is a trademark or registered trademark of Hitachi, Ltd. 29


Data Grid - Metadata

To configure the Data Grid step:

1. Drag and drop the Metadata Data Grid step onto the canvas.
2. Double-click to set the properties as outlined below:

3. Click on the Data tab:

Each stream fields are associated with either integer or string values.

HITACHI is a trademark or registered trademark of Hitachi, Ltd. 30


ETL Metadata Injection

1. Drag and drop the ETL Metadata Injection step onto the canvas.
2. Double-click to set the properties as outlined below:

Picks up the Metadata Injection Template.

3. Click on the Options tab:

• The output is the _injected.ktr.


• The _injected.ktr becomes the template for executing the Phase 2 Transformation.

HITACHI is a trademark or registered trademark of Hitachi, Ltd. 31


Optional target file (ktr after injection):
${Internal.Entry.Current.Directory}/output/${Internal.Transformation.N
ame}

_injected.ktr

4. Save the Transformation as: tr_2_phase_mdi.ktr.

Phase 2

In Phase 2, a Transformation Executor RUNs the Transformation, referencing the _injected.ktr as


the template.

Generate Rows

Just used to generate 5 rows..

Add sequence

The Add sequence step adds a sequence to the stream. A sequence is an ever-changing integer value
with a specific start and increment value. You can either use a database sequence to determine the
value of the sequence, or have it generated by Kettle.

Kettle sequences are unique only when used in the same transformation. Also, they are not stored, so
the values start back at the same value every time the transformation is launched.

1. Drag and drop the Get value from Sequence step onto the canvas.
2. Double-click to set the properties as outlined below:

HITACHI is a trademark or registered trademark of Hitachi, Ltd. 32


HITACHI is a trademark or registered trademark of Hitachi, Ltd. 33
Transformation Executor

The Transformation Executor step allows you to execute a Pentaho Data Integration (PDI)
transformation. By default, the specified transformation will be executed once for each input row.

1. Drag and drop the Transformation Executor step onto the canvas.
2. Double-click to set the properties as outlined below:

• Ensure the step is pointing to the _injected.ktr and the sequence variable has been set.
• The sequence is set as a variable to distinguish the records on output.
3. Click on the Row Grouping tab. Notice that the rows are sent one by one to the Transformation.

HITACHI is a trademark or registered trademark of Hitachi, Ltd. 34


3. Save Transformation as: tr_2_phase_2_mdi.ktr

Step 3 - RUN the MDI Workflow

1. RUN tr_2_phase_mdi.ktr.
2. Check to see that the tr_2_phase_mdi_injected.ktr has been created.
3. RUN tr_2_phase_2_mdi.ktr.
4. The Transformation Executor step will run 5 times for each generated record.
5. Open the files located at:
tr_2_phase_template_output_1 to 5.txt

Logging Results

HITACHI is a trademark or registered trademark of Hitachi, Ltd. 35


Summary of Metadata Architectures
In general, a MDI process can:

1. To simply inject metadata and call the template transformation is referred to as: MDI Standard.
2. Additionally, the data can be pushed (streamed) from the main to the template transformation and
conversely pulled back. The use case is when the template processes dynamic data from the main
transformation referred to as: MDI Data Flow
3. For big data use cases, the template transformation with metadata defines an _injected.ktr. - Phase
1. In Phase 2, the _injected.ktr is used as the template for the Transformation. Referred to as: MDI 2
Phase Processing, used in our Onboarding Blueprint for Big Data: Filling the Data Lake

HITACHI is a trademark or registered trademark of Hitachi, Ltd. 36

You might also like