Professional Documents
Culture Documents
Contents
Page 2: Introduction to Metadata Injection
Page 7: Guided Demonstration Overview: Metadata Injection
Page 13: Guided Demonstration – Standard Metadata Injection
Page 20: Guided Demonstration – Push / Pull Metadata Injection
Page 27: Guided Demonstration – 2 - Phase Metadata Injection
Page 36: Summary of Metadata Architectures
Introduction to Metadata Injection
Metadata is traditionally defined and configured at design time, in a process known as hard-coding,
because it does not change at run time. This static ETL approach is a good one to take when you are
onboarding just one or two data sources where you can easily enter metadata manually for your
transformation. However, this hard-coding approach presents some complications, including:
• Time consumption
• Repetitive manual tasks
• Error-prone solutions
• High labour costs of designing, developing, and supporting a fragile solution
• Added risk when predictable outcomes are jeopardized
Metadata injection is the dynamic ETL alternative to scaling robust applications in an agile environment.
One transformation can service many needs by building a framework that shifts time and resources to
runtime decisions. This operation dramatically reduces upfront time-to-value and flattens the ongoing
investment in maintenance. When you are dealing with many data sources that have varying schemas,
try metadata injection to drastically reduce your development time and accelerate your time to value.
• Many datasources
• Different naming conventions
• Similar content
• Dissimilar structure
• Common destination
The ETL Metadata Injection step can be used in transformations to inject metadata into another
transformation, normally with input and output steps for standardizing filenames, naming or renaming
fields, removing fields, and adding fields.
Note: Pentaho’s metadata injection helps you accelerate productivity and reduce risk in complex data
onboarding projects by dynamically scaling out from one template to many transformations.
https://help.pentaho.com/Documentation/8.1/Products/Data_Integration/T
ransformation_Step_Reference/ETL_Metadata_Injection/Steps_Supporting_M
DI
ETL integration development takes time for gathering requirements, building, testing, documenting,
deploying, and monitoring production. Rules, requirements, and data itself may change, over time. If
that happens, the current rules may no longer apply or new rules may need to be added to the existing
transformation to continue working. We recommend using flexible, data-driven ETL patterns to make
your data integration transformation powerful and adaptable to changing business rules without going
through a development cycle.
Data Streaming
Since version 5.1, this step is capable of streaming data from one transformation into another.
To pass data from your template transformation (after injection, during execution) to your current
transformation, specify Template step to read from. You can also specify the expected output fields
easily design the steps which come after the ETL Metadata Injection step.
To pass data from a source step into the template transformation (again, after injection) you can specify
Streaming source step and Streaming target step in the template transformation.
Metadata injection refers to the dynamic passing of metadata to PDI transformations at run time to
control complex data integration logic. The metadata (from the data source, a user defined file, or an
end user request) can be injected on the fly into a transformation template, providing the “instructions”
to generate actual transformations. This enables teams to drive hundreds of data ingestion and
preparation processes through just a few actual transformations, heavily accelerating time to data
insights and monetization. In data onboarding use cases, metadata injection reduces development time
and resources required, accelerating time to value. At the same time, the risk of human error is reduced.
• Passing in different filenames (paths and filenames can be different for each run)
• Passing different values into a custom database structured query language (SQL) statement to
allow for different behaviours (from different tables’ names, and where clause field name
values)
The ETL Metadata Injection step exposes the metadata properties of your ‘template’ steps. This step
enables you to map existing metadata properties to new injected metadata properties.
Objectives Once the repeatable pattern has been defined in a template, the ETL Metadata
Injection step, exposes their metadata properties, which can then be mapped
to the corresponding injected source stream field.
For this scenario, the onboarding of the data is achieved with a template.
Note: The steps in the template that define the scope of the metadata injection properties.
The Source steps and fields can now be mapped to the corresponding Target Injection Step.
In this example, the Source step: Using variable to resolve filename and stream Field: filename,
is mapped to the template: CSV file input step and FILENAME datastream field.
File Metadata
Stream Metadata
The following step gives you several the metadata properties associated with the incoming stream.
Database Metadata
An effective way to learn how metadata injection works is to develop a simple application. The following
steps will guide you through creating a simple application for metadata injection:
Data tab: This grid contains the data. Everything is entered in String format so make sure you use the
correct format masks in the metadata tab.
1. Drag and drop the Data Grid step onto the canvas.
2. Double-click to set the properties as outlined below:
This is the data ingestion step. It could be a table, flat file, or others.
1. Drag and drop the Text file output step onto the canvas.
2. Double-click to set the properties as outlined below:
• Just add the path to the output file. Notice the internal variables used to define the filename.
Filename:
${Internal.Entry.Current.Directory}/${Internal.Transformation.Name}_ou
tput
Data Grid
1. Drag and drop the Data Grid step onto the canvas.
2. Double-click to set the properties as outlined below:
• fieldname
• mdi_fieldname
3. Click on the Data tab:
1. Drag and drop the ETL Metadata Injection step onto the canvas.
2. Double-click to set the properties as outlined below:
1. Run tr_standard_mdi.ktr.
2. Open the file located at:
C:\NEXT-2018
\tr_standard_mdi_output.txt
Objectives In this guided demonstration, you will configure Push - Pull Metadata Injection.
Step 1 - Template
Select Values
Dummy - Result
As you have guessed, Push – Pull workflow is a combination of the previous workflows.
1. Drag and drop the Test data - Input step onto the canvas.
2. Double-click to set the properties as outlined below:
This is the data ingestion step. It could be a table, flat file, or others..
1. Drag and drop the Data Grid step onto the canvas.
2. Double-click to set the properties as outlined below:
Each stream fields are associated with either integer or string values.
1. Drag and drop the ETL Metadata Injection step onto the canvas.
2. Double-click to set the properties as outlined below:
The data is streamed (pushed) from the Test data -input step of the MDI workflow to the Input
Stream step of the template.
1. Drag and drop the Test file output step onto the canvas.
2. Double-click to set the properties as outlined below:
1. RUN tr_push_pull_mdi.ktr.
2. Open the file located at:
C:\NEXT-2018
\ tr_push_pull_mdi_output.txt
Phase 1
Step 1 - Template
The template explicitly renames the fields to the new mdi fields, in the Select values step.
1. Drag and drop the Data Grid step onto the canvas.
2. Double-click to set the properties as outlined below:
Select values
Filename:
${Internal.Entry.Current.Directory}/${Internal.Transformation.Name}
_output_${sequence}
1. Drag and drop the Metadata Data Grid step onto the canvas.
2. Double-click to set the properties as outlined below:
Each stream fields are associated with either integer or string values.
1. Drag and drop the ETL Metadata Injection step onto the canvas.
2. Double-click to set the properties as outlined below:
_injected.ktr
Phase 2
Generate Rows
Add sequence
The Add sequence step adds a sequence to the stream. A sequence is an ever-changing integer value
with a specific start and increment value. You can either use a database sequence to determine the
value of the sequence, or have it generated by Kettle.
Kettle sequences are unique only when used in the same transformation. Also, they are not stored, so
the values start back at the same value every time the transformation is launched.
1. Drag and drop the Get value from Sequence step onto the canvas.
2. Double-click to set the properties as outlined below:
The Transformation Executor step allows you to execute a Pentaho Data Integration (PDI)
transformation. By default, the specified transformation will be executed once for each input row.
1. Drag and drop the Transformation Executor step onto the canvas.
2. Double-click to set the properties as outlined below:
• Ensure the step is pointing to the _injected.ktr and the sequence variable has been set.
• The sequence is set as a variable to distinguish the records on output.
3. Click on the Row Grouping tab. Notice that the rows are sent one by one to the Transformation.
1. RUN tr_2_phase_mdi.ktr.
2. Check to see that the tr_2_phase_mdi_injected.ktr has been created.
3. RUN tr_2_phase_2_mdi.ktr.
4. The Transformation Executor step will run 5 times for each generated record.
5. Open the files located at:
tr_2_phase_template_output_1 to 5.txt
Logging Results
1. To simply inject metadata and call the template transformation is referred to as: MDI Standard.
2. Additionally, the data can be pushed (streamed) from the main to the template transformation and
conversely pulled back. The use case is when the template processes dynamic data from the main
transformation referred to as: MDI Data Flow
3. For big data use cases, the template transformation with metadata defines an _injected.ktr. - Phase
1. In Phase 2, the _injected.ktr is used as the template for the Transformation. Referred to as: MDI 2
Phase Processing, used in our Onboarding Blueprint for Big Data: Filling the Data Lake