Professional Documents
Culture Documents
Pentaho Data
Data
Integration
Integration
Training Prerequisites
• Audience
• Technical users who build/maintain data models for analysis, and manage BI
data/metadata from various data sources. These users can be:
• Database Developers, Business Analysts, BI Architects and Systems Integrators
• Technical
• Knowledge of SQL and relational database concepts
• System
• Any machine running Windows/Linux OS with at least 4 GB RAM with installation rights.
• Sun Java 1.7x JDK/JRE
• MySQL 5x and above
What we will learn in this
training??
• Basic architecture of Pentaho Data Integration (PDI)
• Interaction with different data sources
• Integrating various data sources
• Work with various transformations (steps)
• Performance Tuning(DB/PDI)
• Jobs and Transformations in detail
• Variables, Parameters and Arguments
• Building flexible Jobs and Transformations
• Logging, Monitoring and Error handling in PDI
• Use of Javascript and Java classes
Training Methodology
• 8 hours a day + 1 hour Lunch
• Couple of small breaks as and when needed
• Hands-on training with many exercises
• A small project at the end of the sessions
• Feedback from the participants at the end of the training
Pentaho
PentahoData
DataIntegration
IntegrationOverview
Overview
Introduction
• PDI is the product associated with the KETTLE
open source project:
• KETTLE is open source software that makes
up the core of PDI Enterprise Edition
• PDI Enterprise Edition is production ready
• Professional technical support
• Maintenance releases
• EE-only features including enterprise
security, scheduling, monitoring and
more
• Documentation
• PDI is a member of the Pentaho BI Suite.
Pentaho Data Integration Features
• Ease of use:
• 100% metadata driven (define WHAT you want to do, not HOW to do it)
• No extra code generation means lower complexity
• Simple setup, intuitive graphical designers, and easy to maintain
•Flexibility:
• Never forces a certain path on the user
• Pluggable architecture for extending functionality
Structured Data
Dashboards
Metadata
Unstructured Data
PDI Layer Reports
Data Storage
Analyzer
AGILE BI
• Modeling and Visualization perspectives. Analyze the data from within
the PDI.
• Model Perspective.
• Visualization Perspective.
The DI Repository
PDI can store metadata in:
• XML files
• RDBMS repository
• Enterprise repository
Objects stored in repository:
• Connections
• Transformations
• Jobs
• Schemas
• User and profile definitions
• Kitchen
• Command line tool for executing jobs modeled in Spoon
• Carte
• Lightweight web\HTTP server for remotely executing jobs and transformations
• Carte accepts XML containing the transformation to execute and the configuration
• Enables remote monitoring
• Used for executing transformations in a cluster
Transformations
• Due to the internal processing, this step is much faster than text
file input.
• Export Option:
• Sheet Name
• Protect Sheet with Password
• Use a template
• Append the content of the template
Text File Output
• Can give output in many format including fixed file, CSV etc.
• Options:
Extension
Append
Separator
Enclosure
Header/footer
Zipped
Include step number/date/time in file name
Encoding
Right pad fields
Split every or ‘n’ row(s)
Write directly to servlet output when run on Carte or DI server
Table Output
• Inserts the information in the database table
• Options:
• Target table
• Commit size
• Truncate table
• Ignore insert errors
• Partition data over tables Use
• batch update for inserts Return
• auto-generated key Name
• auto-generated key field
• Is the name of the table defined in a field
Insert/Update
• Automates simple merge processing:
• Look up a row using one or more lookup keys.
• If a row is not found, insert a new row.
• If the row is found and target fields are same do nothing else update
• Options:
• Connection
• Target table
• Commit size
• Keys
• Update fields
• Do not perform any update
Update and Delete
• Update
• This is similar to Insert/Update but it only Updates
• Delete
• Rows that are matched for the keys are deleted. This is similar to the filter in
the where clause.
Exercise
Overview
•This exercise is designed to introduce you to various methods of interacting with
data using Pentaho Data Integration.
•In this exercise, you create a database connection, explore a data source, and
create transformations that use various data input and output steps.
Objectives:
• Create a database connection
• Use Database Explorer to interact with a data source
• Create a transformation that uses the Table input and Table output steps
• Create a transformation that uses the Text file output step
• Create a transformation that uses the CSV file input and Insert/Update steps
• Create a transformation that uses the Table input step that loads data based
on a parameter value
Data Warehousing Steps
Design the Target Database
• In the best case scenario, the target database or the Data Warehouse
should be designed before one starts creating the mappings from
source to target.
• Staging table should be ideally be the same as the source to make the
extraction process simple.
• The OLAP schema needs to be designed based on the
reporting/analytical need of the end user.
Creating Source to Target Mappings
• Creating a Mapping document to determine how the source column
can be mapped to the target column.
• This will involve all the transformation logic, exception handling, data
type and length.
• This document is useful for all the development activities during the
ETL development.
Creation of a Dimension Model
• Based on Ralph Kimball’s principle, create the dimension model for
the analytical requirements of the end/business user.
• Typically, there will be one Facts table and multiple Dimension tables
in the star schema.
• Dimension tables will hold the details of the context for analysis.
• Fact Table will hold the measures and the reference to the dimension
keys. These are huge table will millions of records.
• There can be aggregate/summary tables for faster reporting.
Slowly Changing Dimensions
•Three type of SCDs are most commonly used in the industry:
•Type 1 dimension:
•New information overwrites old information
•Old information not saved, it is lost
•Can only be used in applications where maintaining a chronicle of data is not essential - Used for update only
•Type 2 dimension:
•New information is appended to old information
•Old information saved - Is effectively versioned
•Can be used in applications where maintaining chronicle of data is required so changes in data warehouse
can be tracked
•Type 3 dimension:
•New information is saved alongside old information
•Old information partially saved
•Additional columns created to show the time from which the new information takes effect
•Enables view of facts in both current state and “what-if” states of dimensional values
Dimension Lookup/Update
• Implements slowly changing dimensions: Type 1 and Type 2.
• Can be used for updating a dimension table and for looking up values in
a dimension.
• Lookup, if not found, then update/insert.
• Each entry in the dimension table has the following fields:
• Technical key: Primary (surrogate) key of the dimension
• Version field: Shows the version of the dimension entry (a revision number)
• Start of date range: The field name containing valid starting date
• End of date range: The field name containing valid ending date
• Keys: Business keys used in source systems (such as customer number,
• product id) - Used for lookup functionality
• Fields: Contain actual dimension information
• Can be set individually to update all versions
• Can also be set to add a new version when a new value appears
Combination Lookup/Update
• This is also called as Junk Dimension
• Creates the Cartesian product for the degenerated dimension where
the data is in codes/bits in the Fact table
pentaho_oltp
Training Data: Tables
Offices
7 offices worldwide (San Francisco, Boston, NYC, Paris, Tokyo,
Sydney, London)
Headquartered in San Francisco, CA
Each office assigned to a sales territory (APAC, NA, EMEA or
JAPAN)
Employees
23 employees: 6 executives and 17 sales representatives
Each assigned to one of seven offices
Sales representatives also assigned to particular number of
customers (distributors)
New sales representatives (still in training) do not have
assigned customers
Pentaho Training Data: Tables (2
of 3)
Customers
Steel Wheels has 122 customers worldwide
Approximately 20 are new customers without a sales representative
Each has a credit limit which determines maximum outstanding
balance
Products
110 unique models purchased from 13 vendors
Classified as 7 distinct product lines: Classic cars, vintage
cars,
motorcycles, trucks and buses, planes, ships, trains
Models classified based on scale (1:18, 1:72, and so on)
Cost paid and MSRP (manufacturer’s suggested retail price)
Payments
Customers make payments on average 2-3 weeks after placing order
In some cases, one payment pays more than 1 order
Pentaho Training Data: Tables
(3 of 3)
Orders
2,560 orders spanning period from 1/1/2000 to 12/31/2007
Each in a given state: In process, shipped, cancelled,
disputed,
resolved, on hold
Orderdetails
Order line items reflect negotiated price and quantity perproduct
Training database has 23,640 records in Orderdetails
Database Lookup
•Lookup attributes from a single table based on a key-matching criteria
•Options:
• Lookup table: Name of the table where the lookup is performed
• Enable cache: Caches database lookups for the duration of the Transformation
Enabling can increase performance
• Danger: If other processes can change values in table, do not set this option
• Load all data from table: Preload the complete data in memory at initialization
phase
•Can replace a Stream lookup step in combination with a Table input
step and is faster
Stream Lookup
•Allows users to lookup data using information from other steps in
transformation
•Data from source step is first read into memory (cached) and then used to
look up data for each record in main stream
•Options:
• Source step: Step used to obtain the in-memory lookup data
• Key(s) to lookup value(s): Specify names of fields used to
lookup values
• Values always searched using the equals (=) operator
• Fields to retrieve: Specify names of fields to retrieve, default in
case value not found, or new field name if output stream field
name should change
Merge Join
•Merge join step performs a merge join between data sets using data
from two different input steps.
• Options:
• Step name: Unique name of step
• First Step: First input step to the merge join
• Second Step: Second input step to the merge join
• Join Type: INNER, LEFT OUTER, RIGHT OUTER, or FULL OUTER
• Keys for 1st step: Key fields on which incoming data is sorted
• Keys for 2nd step: Key fields on which incoming data is sorted
Database Join
• Options:
• SQL: The database SQL query
• Number of rows to return: 0 means all, any other number limits number of
rows
• Outer join?: When checked, always return a single record for each input
stream record, even if query did not return a result
•Database join step allows parameters in the query
• Parameters noted as ‘?’
• Order of fields in parameter list must match the order of the ‘?’ in the query
Select Values
• Used for:
• Select/remove fields from process stream
• Rename fields
• Specify/change the length and/or precision of fields
• Options:
• Select and alter: Specify exact order and field names used foroutput rows
• Remove: Fields removed from output rows
• Meta-data: Changes to name, type, length and precision (metadata) of field(s)
Calculato
r
• Provides list of functions executed on field values
• Advantage: Execution speed of calculator is many times that of
JavaScript
• Besides arguments (Field A, Field B and Field C), must also specify the
return type of the function
Filter Rows
• Filter rows based upon conditions and comparisons (full boolean logic
supported).
• Output can be diverted into 2 streams: Records that meet the
condition (true) and records that do not (false).
• Identify exceptions that must be written to a bad file
• Branch transformation logic if single source has two interpretations
• Options:
• Send ‘true’ data to step: Step that receives rows when condition is true
• Send ‘true’ data to step: Step that receives rows when condition is true
Sort Rows
•Sort rows based on specified fields (including sub-sorts), in ascending
or descending order.
•Options:
• List of fields and whether they should be sorted
• Sort directory: Directory in which temporary files are stored when needed
• Default is the system temporary directory
• Sort size: More rows stored in memory yields faster sorts
• Eliminates need for temp files (reducing costly disk input/output)
• The TMP-file prefix: Prefix used to identify files in temp directory
Merge Rows
• Compares and merges two streams of data: Reference stream and compare
stream
•Mostly used to identify deltas in source data when no timestamp is available
• Reference stream = Previously loaded data
• Compare stream = Newly extracted source data
•Parameters:
• Parameters are also user defined inputs to a transformation/job which can be used
throughout the ktr/kjb. It can also have a default value to use In the event that one
is not provided to it.
•Variables:
• A variable in PDI can be used to store the values, dynamically and programmatically
that can be access based on the scope it is defined for.
Exchanging data between Transformations
• Data can be exchanged between the transformation within a Job.
• This is done in memory.
• In transformation, the following steps are used:
• Copy rows to result
• Get rows from result
• This is useful to segregate various functionalities (Transformations).
• In sub-job, uncheck the “Clear list of result rows before execution”.
Command-Line Parameters
A job can be called from a .bat /.sh file with parameters.
For example:
kitchen.bat /file:”directory\parajob.kjb” test1 test2
Parameters can be transferred to subsequent transformations.
Attention: “Copy previous results to args” MUST NOT be checked, otherwise the
parameters are not transferred to the transformation.
Running Job Entries in Parallel
By default, job entries only run sequentially
Even when they are designed this way:
•Logging is tightly linked to monitoring and scheduling. You should know if scheduled jobs run
successfully, how much time they require, and so on.
•Headless operation
• Most ETL in production is not run from user interface
• Need location to view job results
•Performance monitoring
• Useful information for current performance problems and capacity planning
Types of Logging
• File based
• Verbose
• Conventional “Log”
• Database logging
• RDBMS based
• Structured
• Summarized
Log Levels
• There are various log levels based on requirement
• Error
• Nothing
• Minimal
• Basic
• Detailed
• Debug
• Row Level
Scheduling Jobs/Transformations
• Scheduling is an option that is present in the DI server.
• Other options to schedule Jobs/Transformations
• Crontab in unix/linux
• Windows Task Scheduler
• It can also be scheduled by calling it in BI-Server (xAction)
Pre and Post Processing
• Pre Processing
• Drop index before bulk load
• Pull the files from a remote FTP server
• Unzip all the zipped files
• Create target table structures
• Post Processing
• Re-create index
• Delete files from remote FTP server
• Clear Temp tables
• Send completion Emails
• Update Summary/control tables
Tuning
• Performance Tuning Steps
• Select Values – Avoid removing the fields from the flow
• Get Variables – Do not use it at the beginning for a high volume stream.
Instead, use it in a separate stream and join using “Join Rows”.
• Lazy Conversion – Using this can improve the performance in certain cases.
• Text File Input – Avoid using it for CSV/Fixed files. Use specific steps instead.
• Java Script – Use this unless necessary.
• In general, avoid using sort and group by steps unless necessary.
• Carte:
• Simple Web Server that helps execute transformation/jobs remotely.
• Can be executed by executing the following command from the DI folder:
• carte.sh/bat <hostname> <port>
Clustering continued…
• How to configure a cluster:
• Cluster nodes need to be defined
• Define Cluster schemas based on the existing available nodes
• So two CPU cores are under heavy load, and the rest is idle. To leverage
the idle cores, multiple copies of the group by step could be started.
…
… …
Row Denormaliser
Options include:
Key field: Field that defines the key (Metric Type in the example)
Group fields: Fields that make up the grouping (Weekdate in
the
example)
Target fields: Fields to denormalize - Specify the string value for
the
key field (Quantity Miles, Loaded_Miles, Empty_Miles in the
●
example)
Options are provided to convert data types
● ● Most designs use strings to store values – helpful if the value is actually
a
number or date
If there are key-value pair collisions (key is not unique for the
group
specified), specify aggregation method used to compute the new
Row Flattener
Flattens sequentially provided rows
Usage Notes:
Rows must be sorted in proper
order
Use denormalizer if key-value pair intelligence required for flattening
Example:
Field1 Field2 Field3 Flatten
A B C One
A B C Two
D E F Three
D E F Four
in the example)
ETL Patterns