You are on page 1of 99

Pentaho

Pentaho Data
Data
Integration
Integration
Training Prerequisites
• Audience
• Technical users who build/maintain data models for analysis, and manage BI
data/metadata from various data sources. These users can be:
• Database Developers, Business Analysts, BI Architects and Systems Integrators

• Technical
• Knowledge of SQL and relational database concepts

• System
• Any machine running Windows/Linux OS with at least 4 GB RAM with installation rights.
• Sun Java 1.7x JDK/JRE
• MySQL 5x and above
What we will learn in this
training??
• Basic architecture of Pentaho Data Integration (PDI)
• Interaction with different data sources
• Integrating various data sources
• Work with various transformations (steps)
• Performance Tuning(DB/PDI)
• Jobs and Transformations in detail
• Variables, Parameters and Arguments
• Building flexible Jobs and Transformations
• Logging, Monitoring and Error handling in PDI
• Use of Javascript and Java classes
Training Methodology
• 8 hours a day + 1 hour Lunch
• Couple of small breaks as and when needed 
• Hands-on training with many exercises
• A small project at the end of the sessions
• Feedback from the participants at the end of the training
Pentaho
PentahoData
DataIntegration
IntegrationOverview
Overview
Introduction
• PDI is the product associated with the KETTLE
open source project:
• KETTLE is open source software that makes
up the core of PDI Enterprise Edition
• PDI Enterprise Edition is production ready
• Professional technical support
• Maintenance releases
• EE-only features including enterprise
security, scheduling, monitoring and
more
• Documentation
• PDI is a member of the Pentaho BI Suite.
Pentaho Data Integration Features
• Ease of use:
• 100% metadata driven (define WHAT you want to do, not HOW to do it)
• No extra code generation means lower complexity
• Simple setup, intuitive graphical designers, and easy to maintain

•Flexibility:
• Never forces a certain path on the user
• Pluggable architecture for extending functionality

•Modern standards-based architecture


• 100% Java with broad, cross-platform support
• Over 100 out-of-the-box mapping objects (steps and job entries)
• Enterprise class performance and scalability

•Lower total cost of ownership (TCO)


• No license fees
• Short implementation cycles
• Reduced maintenance costs
Common Uses
• Data warehouse population
• Export of database(s) to text-file(s) or other databases
• Import of data into databases, ranging from text-files
to Excel © spreadsheets
• Data migration
• Information enrichment by looking up data
• Data cleansing
Where PDI is fits

Structured Data
Dashboards

Metadata
Unstructured Data
PDI Layer Reports

Data Storage
Analyzer
AGILE BI
• Modeling and Visualization perspectives. Analyze the data from within
the PDI.
• Model Perspective.
• Visualization Perspective.
The DI Repository
PDI can store metadata in:
• XML files
• RDBMS repository
• Enterprise repository
Objects stored in repository:
• Connections
• Transformations
• Jobs
• Schemas
• User and profile definitions

Repository supports collaborative development


DI Repository
• Enterprise Repository and Content Management
• Repository based on JCR (Content Repository API for Java)
• Repository browser
• Full revision history, allowing you to compare and restore previous
revisions of a job or transformation
• Ability to lock transformations/jobs for editing
• ‘Recycle bin’ concept for working with deleted files
• Enterprise security
• Configurable authentication including support for LDAP and MSAD
• Task permissions to control what actions a user/role can perform
• Scheduling
• The Jobs/Transformations can be scheduled from within the repository.
Pentaho Data Integration Components
• Spoon
• Graphical environment for modeling
• Transformations are metadata models describing the flow of data
• Jobs are workflow-like models for coordinating resources, execution, and
dependencies of ETL activities
• Pan
• Command line tool for executing transformations modeled in Spoon

• Kitchen
• Command line tool for executing jobs modeled in Spoon
• Carte
• Lightweight web\HTTP server for remotely executing jobs and transformations
• Carte accepts XML containing the transformation to execute and the configuration
• Enables remote monitoring
• Used for executing transformations in a cluster
Transformations

• Network of logical steps that take care of the Data-Flow


• Deals with Data
• Connects to various Data sources
• Data massaging and cleansing is done here
• Smallest component that can be executed independently
Hops - Within Transformations
• Can be considered as connector for the steps in the transformations.
• Metadata flow takes place with these
• Helps understand the flow of data (source to target)
• Can be used to copy or distribute the data to multiple target steps
• Error handling can be defined with theses
• Info Steps
Data Flow, Threading
Mechanism
• All the steps in a Transformation is run in parallel, sequence can not
be determined.
• PDI internally manages the data flow between the steps along with
the metadata flow.
• PDI processes the data in batches.
• Any number of data rows can be processed.
• Processing is done on the stream of data.
• Source step will not pull the data till it is able to accommodate another batch
of data.
Data, Metadata and Values
• Data rows is composed of metadata and data.
• The first row is the metadata flow and the subsequent rows
references.
• Values are columns in the data rows.
• PDI data types are mapped with the database (JDBC) datatypes.
Data, Metadata and Values
• Metadata is used for formatting while preview and writing the data to
the target data source.
• Metadata is not used when there is one to one mapping without any
changes, ex. Table to table data-load.
• Metadata is used for creating SQL statements (DDLs) for database
output steps.
• Metadata is used to verify if the data types are correct.
• Change in Metadata will not change the data.
• Modification of length does not change/truncate data
• New formatting does not change data
Available Data Types
• Number (double, floating point)
• Integer
• Big Number
• String
• Date – Includes date and
time. yyyy/MM/dd
HH:mm:ss.SSS
• Boolean – True/False
• Binary – Mainly used for Blob data.
• Serializable – An object to transfer from/to specific steps
Lazy Conversion
• There are options for few steps where the delayed conversion of the
data type is done for optimal performance.
• It is done only when it is necessary to do so.
• Typically done at the output steps
• Not done when the source and target is text file or the source format
is same as the target format.
• Steps that support lazy conversion are CSV File Input, Fixed File, Input
and Table Input
The UI Interface
• Main Tree
• Lists all open transformations and jobs and their contents
• Core objects and favorite steps/job entries
• Favorites are static ‘most’ used steps
• Core is toolbox with all the available steps/job entries
• Notes
• Can be placed anywhere in graphical view by right click and select “Add Note”.
• Options and settings
• Options are valid for the entire PDI environment
• Settings are valid for a particular transformation or Job
Run and Preview
• You can execute entire transformation or just preview a particular step
• Preview also possible by selecting a step and click preview button
• Need at least two steps connected with a hop to run or preview
• Closing a preview (versus choosing the stop/get more rows options) will
leave the transformation in paused execution state
• You have to be cautious as it may cause problems. All the steps are
initialized at once while preview and the data rows are passed on to the
subsequent steps as well.
• You should disable the hops in that case.
Log View
• Helps understand the statistics associated to the execution of a
transformation.
• Log level can be set. Can be set to be as low as the row level.
• Any steps with error is highlighted in the Step Metric tab.
• The Logging tab provides the details of the error and the error lines
are in red.
Debugging
• Allows you to set arbitrary break points in a transformation.
• You can either preview the rows by selecting the retrieve first
rows box or pause the transformation on condition.
Files to be considered
• There are User specific files that needs to be understood:
• kettle.properties : default properties file for storing variables
• shared.xml : default shared objects file
• db.cache : the database cache for metadata
• repositories.xml : the local repositories file
• .spoonrc : User interface settings, last opened transformation/job
• .languagechoice : User language (delete to revert language)

• All these files normally reside in the ~/.kettle folder


KETTLE_HOME
• This is HOME directory for PDI and may changed based on the user
logged in.
• This is set just before any component of PDI is started/invoked.
• This is the directory that contains the .kettle folder.
• This can be configured to have the same for all the user
• Eg. /pentaho/Kettle/common
Connections, Inputs and Outputs
Database Connections
Multiple connections can be created for multiple data source, one for
each data source.
With a PDI repository:
• Defined connections are readily available to transformations and jobs.
• Connection information for the repository is stored in repositories.xml.
Without a PDI repository:
• Connection definition contained in a single transformation or job.
• Can be shared the connection in subsequent transformations and jobs.

The connections appears in the Explorer view

The shared connections will be in Bold


Access via JDBC
• Drivers can by added to the /data-integration/lib.

• Use Generic tab of connection dialog to use unlisted drivers

• Permits connections to non-listed databases

• Existing drivers can be replaced in /data-integration/lib directory.


Other Access Methods
• ODBC connections are possible
• ODBC connections must be defined in Windows.
• ODBC connections made via ODBC-JDBC-Bridge.
• Some limitations on SQL syntax
• Generally slower than JDBC due to additional layer
• Use a JNDI connection to connect to a data source defined in an application
server like JBoss or WebSphere.
• Plugin specific access methods are supplied by a specific database driver (like
SAP R/3 or PALO connections).
DB Cache: The Metadata Cache
• Metadata for fields in each connection and SQL statement is
cached in db.cache

• Refreshed automatically when the table is changed in the PDI


context (from the SQL statement window)

• It should be manually refreshed if you see a mismatch in the


“Show Input/Output fields” option.
SQL Editor
• Creates the needed DDL for output steps related to database
tables
• There is an SQL button in the output step created the DDL statements
• The table layout does not change automatically with the changes in
the metadata of the output step. The DDL in the SQL editor can be
modified manually.
Transformations
Text File Input
• Reads a large number of different text files, including CSV files
generated by spreadsheet applications.
• Options:
• Filename
• Accept Filename from previous step
• Content Specification
• Error Handling
• Filters
• Fields
• Formats
Replaying a Transformation
• Is implemented for Text File Input and Excel© Input
• Allows files containing errors to be sent back to source and
corrected
• Is implemented for Text File Input and Excel© Input
• Uses .line file to reprocess the file
• ONLY lines that failed are processed during replay
• Uses date in the filename of the .line file to match the replay date
CSV File Input
• Reads CSV files only.

• Due to the internal processing, this step is much faster than text
file input.

• Input options are a subset of Text File Input.


• NIO buffer size: Set the buffer size used by the Java I/O classes (NIO,
improved performance in the areas of buffer management
• Lazy conversion: Step supports Lazy Conversion
Generate Rows
• Outputs certain number of rows
• Default is empty but can contain a set of static fields
• Output Options:
• Limit: The number of rows to output
• Fields: Static fields user includes in output row
Fixed File Input
• Reads only fixed file formats.
• Due to the internal processing, this step is much faster than text file
input.
• Input options are a subset of Text File Input.
• NIO buffer size and lazy conversion options identical to CSV file input
• Run in parallel: Must be checked if transformation is executed in cluster with
many workers processing a large file.
Table Input
• Reads information from a database using a connection and SQL
statement(s).
• Options:
• Connection : The source DB connection
• SQL: Any query to pull data using the Connection
• Insert Data from step: The input step name where parameters for the SQL
originate, if appropriate
• Limit: Number of lines to return
Get System Info
• Gets the information about PDI environment
• Options:
• Fields: Output Fields
• System information types:
• Date and time information
• Run-time transformation metadata
• Command-line arguments
Microsoft Excel Input/Output
• Input Options:
• Sheet Tab
• Fields Tab
• Error Handling Tab
• Content Tab

• Export Option:
• Sheet Name
• Protect Sheet with Password
• Use a template
• Append the content of the template
Text File Output
• Can give output in many format including fixed file, CSV etc.
• Options:
 Extension
 Append
 Separator
 Enclosure
 Header/footer
 Zipped
 Include step number/date/time in file name
 Encoding
 Right pad fields
 Split every or ‘n’ row(s)
 Write directly to servlet output when run on Carte or DI server
Table Output
• Inserts the information in the database table
• Options:
• Target table
• Commit size
• Truncate table
• Ignore insert errors
• Partition data over tables Use
• batch update for inserts Return
• auto-generated key Name
• auto-generated key field
• Is the name of the table defined in a field
Insert/Update
• Automates simple merge processing:
• Look up a row using one or more lookup keys.
• If a row is not found, insert a new row.
• If the row is found and target fields are same do nothing else update
• Options:
• Connection
• Target table
• Commit size
• Keys
• Update fields
• Do not perform any update
Update and Delete
• Update
• This is similar to Insert/Update but it only Updates

• Delete
• Rows that are matched for the keys are deleted. This is similar to the filter in
the where clause.
Exercise
Overview
•This exercise is designed to introduce you to various methods of interacting with
data using Pentaho Data Integration.
•In this exercise, you create a database connection, explore a data source, and
create transformations that use various data input and output steps.
Objectives:
• Create a database connection
• Use Database Explorer to interact with a data source
• Create a transformation that uses the Table input and Table output steps
• Create a transformation that uses the Text file output step
• Create a transformation that uses the CSV file input and Insert/Update steps
• Create a transformation that uses the Table input step that loads data based
on a parameter value
Data Warehousing Steps
Design the Target Database
• In the best case scenario, the target database or the Data Warehouse
should be designed before one starts creating the mappings from
source to target.
• Staging table should be ideally be the same as the source to make the
extraction process simple.
• The OLAP schema needs to be designed based on the
reporting/analytical need of the end user.
Creating Source to Target Mappings
• Creating a Mapping document to determine how the source column
can be mapped to the target column.
• This will involve all the transformation logic, exception handling, data
type and length.
• This document is useful for all the development activities during the
ETL development.
Creation of a Dimension Model
• Based on Ralph Kimball’s principle, create the dimension model for
the analytical requirements of the end/business user.
• Typically, there will be one Facts table and multiple Dimension tables
in the star schema.
• Dimension tables will hold the details of the context for analysis.
• Fact Table will hold the measures and the reference to the dimension
keys. These are huge table will millions of records.
• There can be aggregate/summary tables for faster reporting.
Slowly Changing Dimensions
•Three type of SCDs are most commonly used in the industry:
•Type 1 dimension:
•New information overwrites old information
•Old information not saved, it is lost
•Can only be used in applications where maintaining a chronicle of data is not essential - Used for update only
•Type 2 dimension:
•New information is appended to old information
•Old information saved - Is effectively versioned
•Can be used in applications where maintaining chronicle of data is required so changes in data warehouse
can be tracked
•Type 3 dimension:
•New information is saved alongside old information
•Old information partially saved
•Additional columns created to show the time from which the new information takes effect
•Enables view of facts in both current state and “what-if” states of dimensional values
Dimension Lookup/Update
• Implements slowly changing dimensions: Type 1 and Type 2.
• Can be used for updating a dimension table and for looking up values in
a dimension.
• Lookup, if not found, then update/insert.
• Each entry in the dimension table has the following fields:
• Technical key: Primary (surrogate) key of the dimension
• Version field: Shows the version of the dimension entry (a revision number)
• Start of date range: The field name containing valid starting date
• End of date range: The field name containing valid ending date
• Keys: Business keys used in source systems (such as customer number,
• product id) - Used for lookup functionality
• Fields: Contain actual dimension information
• Can be set individually to update all versions
• Can also be set to add a new version when a new value appears
Combination Lookup/Update
• This is also called as Junk Dimension
• Creates the Cartesian product for the degenerated dimension where
the data is in codes/bits in the Fact table

• Degenerated Dimension Combination Dimension


• Gender Dimension ID/Gender/Order Type
• M/F/U 1/M/New
• Order Type 2/F/New
• New/Return 3/U/New
4/M/Return
Introduction to Training Data
Training Data
Represents fictitious company: Steel Wheels
 Buys collectable model cars, trains, trucks,
and so on from manufacturers
 Sells to distributors across the globe

Data adapted from sample data provided by Eclipse


BIRT project

pentaho_oltp database has many tables:


 Offices, Employees, Customers, Products,
Orders, Orderdetails, Payments, and so
on

pentaho_oltp
Training Data: Tables
Offices
 7 offices worldwide (San Francisco, Boston, NYC, Paris, Tokyo,
Sydney, London)
 Headquartered in San Francisco, CA
 Each office assigned to a sales territory (APAC, NA, EMEA or
JAPAN)
Employees
 23 employees: 6 executives and 17 sales representatives
 Each assigned to one of seven offices
 Sales representatives also assigned to particular number of
customers (distributors)
 New sales representatives (still in training) do not have
assigned customers
Pentaho Training Data: Tables (2
of 3)
Customers
 Steel Wheels has 122 customers worldwide
 Approximately 20 are new customers without a sales representative
 Each has a credit limit which determines maximum outstanding
balance
Products
 110 unique models purchased from 13 vendors
 Classified as 7 distinct product lines: Classic cars, vintage
cars,
motorcycles, trucks and buses, planes, ships, trains

Models classified based on scale (1:18, 1:72, and so on)

Cost paid and MSRP (manufacturer’s suggested retail price)
Payments
 Customers make payments on average 2-3 weeks after placing order
 In some cases, one payment pays more than 1 order
Pentaho Training Data: Tables
(3 of 3)
Orders
 2,560 orders spanning period from 1/1/2000 to 12/31/2007
 Each in a given state: In process, shipped, cancelled,
disputed,
resolved, on hold
Orderdetails
 Order line items reflect negotiated price and quantity perproduct
 Training database has 23,640 records in Orderdetails
Database Lookup
•Lookup attributes from a single table based on a key-matching criteria
•Options:
• Lookup table: Name of the table where the lookup is performed
• Enable cache: Caches database lookups for the duration of the Transformation
Enabling can increase performance
• Danger: If other processes can change values in table, do not set this option
• Load all data from table: Preload the complete data in memory at initialization
phase
•Can replace a Stream lookup step in combination with a Table input
step and is faster
Stream Lookup
•Allows users to lookup data using information from other steps in
transformation
•Data from source step is first read into memory (cached) and then used to
look up data for each record in main stream
•Options:
• Source step: Step used to obtain the in-memory lookup data
• Key(s) to lookup value(s): Specify names of fields used to
lookup values
• Values always searched using the equals (=) operator
• Fields to retrieve: Specify names of fields to retrieve, default in
case value not found, or new field name if output stream field
name should change
Merge Join
•Merge join step performs a merge join between data sets using data
from two different input steps.
• Options:
• Step name: Unique name of step
• First Step: First input step to the merge join
• Second Step: Second input step to the merge join
• Join Type: INNER, LEFT OUTER, RIGHT OUTER, or FULL OUTER
• Keys for 1st step: Key fields on which incoming data is sorted
• Keys for 2nd step: Key fields on which incoming data is sorted
Database Join
• Options:
• SQL: The database SQL query
• Number of rows to return: 0 means all, any other number limits number of
rows
• Outer join?: When checked, always return a single record for each input
stream record, even if query did not return a result
•Database join step allows parameters in the query
• Parameters noted as ‘?’
• Order of fields in parameter list must match the order of the ‘?’ in the query
Select Values
• Used for:
• Select/remove fields from process stream
• Rename fields
• Specify/change the length and/or precision of fields
• Options:
• Select and alter: Specify exact order and field names used foroutput rows
• Remove: Fields removed from output rows
• Meta-data: Changes to name, type, length and precision (metadata) of field(s)
Calculato
r
• Provides list of functions executed on field values
• Advantage: Execution speed of calculator is many times that of
JavaScript
• Besides arguments (Field A, Field B and Field C), must also specify the
return type of the function
Filter Rows
• Filter rows based upon conditions and comparisons (full boolean logic
supported).
• Output can be diverted into 2 streams: Records that meet the
condition (true) and records that do not (false).
• Identify exceptions that must be written to a bad file
• Branch transformation logic if single source has two interpretations
• Options:
• Send ‘true’ data to step: Step that receives rows when condition is true
• Send ‘true’ data to step: Step that receives rows when condition is true
Sort Rows
•Sort rows based on specified fields (including sub-sorts), in ascending
or descending order.
•Options:
• List of fields and whether they should be sorted
• Sort directory: Directory in which temporary files are stored when needed
• Default is the system temporary directory
• Sort size: More rows stored in memory yields faster sorts
• Eliminates need for temp files (reducing costly disk input/output)
• The TMP-file prefix: Prefix used to identify files in temp directory
Merge Rows
• Compares and merges two streams of data: Reference stream and compare
stream
•Mostly used to identify deltas in source data when no timestamp is available
• Reference stream = Previously loaded data
• Compare stream = Newly extracted source data

Usage note: Ensure streams are sorted by comparison key fields


•The output row is marked as follows:
• Identical: Key found in both streams and compared values were identical
• Changed: Key found in both streams but one or more values is different
• New: Key not found in the reference stream
• Deleted: Key not found in the compare stream
Group by
• Calculates aggregated values over a defined group of fields.
• Operates much like the ‘group by’ clause in SQL.
• Options:
• Aggregates: Specify the fields that need to be aggregated, the method (SUM, MIN, MAX) and
the name of the new result field
• Include all rows: If checked, output includes new aggregate records and original detail records
• Must also specify name of output field created and a flag to indicate whether row is aggregate or detail
record

•Usage note: Aggregate function - Concatenate strings separated by – can


be used to create list of keys: “117, 131, 145”
•Need sorted Input
• Another option: Use Memory group by step that handles unsorted input
Exercise
Tasks
1. Explain the purpose of the Value Mapper step
2. Use the Add constants step
3. Use the Filter step
4. Define the functionality of the Stream lookup step
5. Create a Pentaho Data Integration transformation that calculates the elapsed time between
customer orders
6. Map the structure of an online transaction processing database to the structure of an online
analytical processing database
7. Create a transformation that handles slowly changing dimensions in star schemas
8. Create a transformation that handles junk dimensions in star schemas
9. Use the Dimension lookup/update step in a transformation
10. Create a dimension table that stores territory information
11. Create a transformation that maps country data in the data store to territory data in the
dimension table
Introduction to Jobs
Jobs
• This is the component where workflow is managed in PDI.
• Here all the tasks that comprise any process are placed in one place.
• All the tasks/job-entries are connected with hops.
• The flow can be managed by the result of the previous job-entry.
• Unconditional
• Success
• Failure
• It also determines in what order the job-entries will be processed.
• The first entry should always be the Start Button.
Job Entries
• The Job Entries perform various tasks:
• File Management
• Send Mails
• Execute Transformation/Sub-Job
• Check conditions
• File Transfer
• Scripting
Parameters, Arguments and Variables
• Arguments:
• In PDI, argument is a user-supplied input given while executing the Transformation
and Job on the command line. This can be used within the Transformation/Job.

•Parameters:
• Parameters are also user defined inputs to a transformation/job which can be used
throughout the ktr/kjb. It can also have a default value to use In the event that one
is not provided to it.

•Variables:
• A variable in PDI can be used to store the values, dynamically and programmatically
that can be access based on the scope it is defined for.
Exchanging data between Transformations
• Data can be exchanged between the transformation within a Job.
• This is done in memory.
• In transformation, the following steps are used:
• Copy rows to result
• Get rows from result
• This is useful to segregate various functionalities (Transformations).
• In sub-job, uncheck the “Clear list of result rows before execution”.
Command-Line Parameters
A job can be called from a .bat /.sh file with parameters.
For example:
 kitchen.bat /file:”directory\parajob.kjb” test1 test2
Parameters can be transferred to subsequent transformations.
Attention: “Copy previous results to args” MUST NOT be checked, otherwise the
parameters are not transferred to the transformation.
Running Job Entries in Parallel
By default, job entries only run sequentially
 Even when they are designed this way:

 Both transformations are not running inparallel.The sequence


depends on the order of job creation.
Additional Job Topics
• If you want to execute the Job entries in parallel, right click the job
entries and select “Run Next Entries in Parallel”.
• Most of the Job-entries will return either true or false. JavaScript in
Job will just evaluate unlike the one in Transformation.
• Mail can be sent as an Error handling option. Here you can attach the
error log as well.
Portable Transformations and Jobs
• This means that the ktr/kjb can run on different environment/setting
by changing only some configuration files and not touching the code.
• Scenarios where we use this:
• Running the kjb/ktr in Dev/UAT/Prod environment
• Change the DB connection details based on the environment
• Processing similar type of files by passing different value to the parameter.
• Done by using variables in as many places as possible. Least possible
HARD CODING.
• Variables can be defined in kettle.properties or set variable in a ktr
that can be used across all the ktrs/kjbs.
• Also done by passing Arguments at runtime or using named
Parameter.
Error Handling in Transformations
• A lot of steps in PDI support Error Handling
• If it does, “Error Handling” option is enabled in the context menu(right click).
• Need to check the checkbox after this and Provide field names for option.
Logging
•Logs contain summary information about job or transformation execution.
• Number of records inserted
• Total elapsed time spent in transformation

•Logs also contain detailed information about job, transformation execution.


• Exceptions
• Errors
• Debugging information

•Logging is tightly linked to monitoring and scheduling. You should know if scheduled jobs run
successfully, how much time they require, and so on.

•The link between logging and monitoring/scheduling is discussed in other modules


(Scheduling and Monitoring and Operations Patterns).
Why we need Logging
•Reliability
• Job errors
• Review errors encountered

•Headless operation
• Most ETL in production is not run from user interface
• Need location to view job results

•Performance monitoring
• Useful information for current performance problems and capacity planning
Types of Logging
• File based
• Verbose
• Conventional “Log”

• Database logging
• RDBMS based
• Structured
• Summarized
Log Levels
• There are various log levels based on requirement
• Error
• Nothing
• Minimal
• Basic
• Detailed
• Debug
• Row Level
Scheduling Jobs/Transformations
• Scheduling is an option that is present in the DI server.
• Other options to schedule Jobs/Transformations
• Crontab in unix/linux
• Windows Task Scheduler
• It can also be scheduled by calling it in BI-Server (xAction)
Pre and Post Processing
• Pre Processing
• Drop index before bulk load
• Pull the files from a remote FTP server
• Unzip all the zipped files
• Create target table structures

• Post Processing
• Re-create index
• Delete files from remote FTP server
• Clear Temp tables
• Send completion Emails
• Update Summary/control tables
Tuning
• Performance Tuning Steps
• Select Values – Avoid removing the fields from the flow
• Get Variables – Do not use it at the beginning for a high volume stream.
Instead, use it in a separate stream and join using “Join Rows”.
• Lazy Conversion – Using this can improve the performance in certain cases.
• Text File Input – Avoid using it for CSV/Fixed files. Use specific steps instead.
• Java Script – Use this unless necessary.
• In general, avoid using sort and group by steps unless necessary.

• More on Performance Tuning:


• http://wiki.pentaho.com/display/COM/PDI+Performance+tuning+check-list#P
DIPerformancetuningcheck-list-Managethreadpriorities
Administration
• Administration
• Setup and Installation
• Logging
• Setting environment variables
• Creating folder structure for file management (temp/archive/deleted/zipped)
• Scheduling
• Monitoring
Interpreting Runtime Data
•Columns:
• Stepname: Name of the step (lookup_region, read_orders)
• Copynr: Copy number if multiple copies are started (0,1,2,3,4, ...)
• Read: Number of records received from PREVIOUS step
• Written: Number of records passed to NEXT step
• Input: Number of records read from a file, database, and so on
• Output: Number of records output to a file, database, and so on
• Rejected: Number of records rejected
• Errors: Number of errors
• Active: Status of step (initializing, running, finished, and so on)
Clustering
• Pentaho Cluster is created by using multiple Carte nodes
• Linked cluster running Carte servers
• For availability and failover
• Improves performance

• Carte:
• Simple Web Server that helps execute transformation/jobs remotely.
• Can be executed by executing the following command from the DI folder:
• carte.sh/bat <hostname> <port>
Clustering continued…
• How to configure a cluster:
• Cluster nodes need to be defined
• Define Cluster schemas based on the existing available nodes

• Running a transformation on cluster


• Not all steps in the transformation are executed on the cluster
• We need to right click and select the cluster
Partitioning

• In the example above The records are retrieved and grouped,


calculating the record count for each page name.

• So two CPU cores are under heavy load, and the rest is idle. To leverage
the idle cores, multiple copies of the group by step could be started.

• However, when started with multiple copies of the group by step


transformation starts giving bad results.
Row Normalizer
Row normaliser normalizes rows of data.
For example:
Weekdate Metric Type Quantity

2001-01-07 Miles 1996

weekdate Miles Loaded_miles Empty_miles 2001-01-07 Loaded Miles 1996

2001-01-07 Empty Miles 0


2001-01-07 1996 1996 0

2001-01-28 Miles 587


2001-01-28 587 539 48
2001-01-28 Loaded Miles 539

….. ….. ….. …..


2001-01-28 Empty Miles 48

This result transforms column names into row descriptor values.


It is possible to normalize more than one field at a time, where groups
of
columns generate unique rows.
Row Normaliser
Options include:
Typefield: Name of the type field (Metric Type in the
example)

Fields: List of fields to normalize
Fieldname: Names of the fields to normalize (Miles, Loaded_Miles,
• Empty_Miles in the example)

Type: A string used to classify the field (Miles, Loaded Miles,


• Empty
Miles in the example).
New field: One or more fields where the new value should be
transferred (Quantity in the example)
Row Denormaliser
Denormalizes data by looking up key-value pairs
For example:

Weekdate Metric Type Quantity

2001-01-07 Miles 1996


weekdate Miles Loaded_miles Empty_miles
2001-01-07 Loaded Miles 1996
2001-01-07 1996 1996 0
2001-01-07 Empty Miles 0
2001-01-28 587 539 48
2001-01-28 Miles 587
….. ….. ….. …..
2001-01-28 Loaded Miles 539

2001-01-28 Empty Miles 48


… …
Row Denormaliser
Options include:
 Key field: Field that defines the key (Metric Type in the example)
 Group fields: Fields that make up the grouping (Weekdate in
the
 example)
Target fields: Fields to denormalize - Specify the string value for
the
key field (Quantity  Miles, Loaded_Miles, Empty_Miles in the

example)
Options are provided to convert data types

● ● Most designs use strings to store values – helpful if the value is actually
a
number or date
If there are key-value pair collisions (key is not unique for the
group
specified), specify aggregation method used to compute the new
Row Flattener
Flattens sequentially provided rows
Usage Notes:
Rows must be sorted in proper
order
Use denormalizer if key-value pair intelligence required for flattening
Example:
Field1 Field2 Field3 Flatten
A B C One
A B C Two
D E F Three
D E F Four

Field1 Field2 Field3 Target1 Target2


A B C One Two
D E F Three Four
Row Flattener
The options provided for this step include:
The field to flatten: Field flattened into different target fields (Flatten
in
the example)
Target fields: The name of the target fields flattened to (Target1, Target2

in the example)
ETL Patterns

•There are various patterns that can be identified by from


the ETL flow:
• Historical Load
• CDC – Change Data Capture
• Upserts
Thank You

You might also like