White Paper

PENTAHO DATA INTEGRATION WITH
GREENPLUM LOADER
The interoperability between Pentaho Data Integration and
Greenplum Database with Greenplum Loader

Abstract
This white paper explains how Pentaho Data Integration (Kettle)
can be configured and used with Greenplum database by using
Greenplum Loader (GPLOAD). This boosts connectivity and
interoperability of Pentaho Data Integration with Greenplum
Database.
February 2012

Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC believes the information in this publication is accurate of
its publication date. The information is subject to change
without notice.
The information in this publication is provided “as is”. EMC
Corporation makes no representations or warranties of any kind
with respect to the information in this publication, and
specifically disclaims implied warranties of merchantability or
fitness for a particular purpose.
Use, copying, and distribution of any EMC software described in
this publication requires an applicable software license.
For the most up-to-date listing of EMC product names, see EMC
Corporation Trademarks on EMC.com.
VMware is a registered trademark of VMware, Inc. All other
trademarks used herein are the property of their respective
owners.
Part Number h8309

PENTAHO DATA INTEGRATION WITH GREENPLUM LOADER

2

...................................................................................................................................................................................... 22 Conclusion ........................... 12 1) Single ETL Server.. 4 Organization of this paper .................................................................. 5 Overview of Pentaho Data Integration ........................ 17 Setup ............... 12 Using gpload to invoke gpfdist ........................................................................................................................................................................................................................................................................................................ 6 Integration of Pentaho PDI and Greenplum Database ...................................................................................................................................................................................... 23 References ................................................................................................................................. 8 Installation of new driver ............................................................................................................................................................................................................................................. 7 Using JDBC drivers for Greenplum database connections ............................................... 16 Usage: How to use Greenplum Loader in Pentaho Data Integration...............................Table of Contents Executive summary.......................................................................... 11 Greenplum Parallel File Distribution Server(gpfdist) ................... Multiple NICs ........................................................ 24 PENTAHO DATA INTEGRATION WITH GREENPLUM LOADER 3 ............. 11 How does gpfdist work? ...................................................................... 16 2) Multiple ETL Servers ............................... 10 External Tables ........................... 10 Parallel Loading .................. 17 Future expansion and interoperability ........................................ 9 Greenplum Loader: Greenplum’s Scatter/Gather Streaming Technology ............................................................ 4 Audience .... 6 Overview of Greenplum Database .......................................................................

support. Audience This white paper is intended for EMC field facing employees such as sales. Pentaho Kettle is part of Pentaho Business Intelligence suite. transform and load massive amounts of data into Greenplum Databases.Executive summary Greenplum database is a popular analytical database which works with different open source data integration products like Pentaho Data Integration (PDI). and shows the readers how Pentaho PDI can be used in conjunction with Greenplum database to retrieve. Greenplum Database can be used both on the source and target sides in the Pentaho ETL transformations. Pentaho is offering a native adaptor support for Greenplum GPLoad capability (bulk loader). PENTAHO DATA INTEGRATION WITH GREENPLUM LOADER 4 . Though the reader is not expected to have extensive Pentaho knowledge. Greenplum Database is capable of managing. basic understanding of Pentaho data integration concepts and ETL tools would help the reader understand this document better. This is neither an installation guide nor an introductory material on Pentaho. as well as customers who will be using Pentaho Data Integration tool to integrate their ETL work. which enables joint customers to leverage data integration capabilities to quickly capture. Currently. It documents the Pentaho connectivity and operation capabilities with Greenplum Loader. Pentaho Data Integration is connected to Greenplum through JDBC (Java Database Connectivity) drivers. a. technical consultants. storing and analyzing large amount of data. Kettle. One of the latest enhancements that Pentaho did for expanded support for OLAP includes a native bulk loader integration with EMC Greenplum to improve the data loading performance. transform and present data to users.a.k.

Organization of this paper This paper covers the following topics:  Executive summary  Organization of this paper  Overview of Pentaho Data Integration (PDI)  Overview of Greenplum Database  Integration of Pentaho PDI and Greenplum Database  Using JDBC drivers for Greenplum database connections  Greenplum Loader: Greenplum’s Scatter/Gather Streaming Technology  Usage: How to use Greenplum Loader in Pentaho Data Integration  Future expansion and interoperability  Conclusion PENTAHO DATA INTEGRATION WITH GREENPLUM LOADER 5 .

PENTAHO DATA INTEGRATION WITH GREENPLUM LOADER 6 . the parallel architecture. It is commonly used in building data warehouses.Overview of Pentaho Data Integration Pentaho Data Integration (PDI) delivers comprehensive Extraction. graphical Jobs/Transformation Designer  Carte – HTTP server for remote execution of Jobs/Transformations  Pan – Command line execution of Transformations  Kitchen – Command line execution of Jobs  Encr – Command line tool for encrypting strings for storage  Enterprise Edition (EE) Data Integration Server – Data Integration Engine. data integration and big data analytics. Transformation and Loading (ETL) capabilities using a meta-data driven approach. Security integration with LDAP/Active Directory. It consists of different components:  Spoon – Main GUI. Data is distributed and replicated across multiple nodes in the Greenplum Database. Content Management Pentaho is capable of loading big data sets in terms of Terabytes or Petabytes into Greenplum Database taking full advantage of the massively parallel processing environment provided by the Greenplum product family. Overview of Greenplum Database Greenplum Database is designed based on a MPP (Massively Parallel Processing) sharednothing architecture which facilitates Business Intelligence. designing business intelligence applications. migrating data and integrating data models. traditional databases and leverages parallelism to ensure orders of magnitude of improvement in query performance. Greenplum’s MPP architecture allows for increased scalability vs. Highlights of the Greenplum Database:  Dynamic Query Prioritization - Provides continuous real-time balancing of the resources across queries. “Shared-nothing” architecture is optimal for fast queries and loads because processors are placed as close as possible to the data itself for faster operations with the maximum degree of parallelism possible. Monitor/Scheduler.

Analytics Support -  Provides intelligent fault detection and fast online differential recovery. Integration of Pentaho PDI and Greenplum Database The following diagram shows the basic interoperability between Pentaho Data Integration with the Greenplum Database: PENTAHO DATA INTEGRATION WITH GREENPLUM LOADER 7 .Provides integrated Greenplum Command Center for advanced support capabilities. Supports analytical functions for advanced in-database analytics. Health Monitoring and Alerting . Self-Healing Fault Tolerance -  Polymorphic Data Storage-MultiStorage/SSD Support -  Includes tunable compression and support for both row-and column-oriented storage.

jar files to the environment. which adds all these .jar) file that are present in the libext/JDBC directory. which is used to connect through Greenplum loader (gpload/gpfdist) when you defined your database connection and choose Native (JDBC) as access. There is a startup script. Pentaho PDI is shipped with a postgresql jdbc jar file. By default.Using JDBC drivers for Greenplum database connections Pentaho Kettle ships with many different JDBC drivers that reside in a single java archive (.6 is required for the installation. Java JDK 1. PENTAHO DATA INTEGRATION WITH GREENPLUM LOADER 8 .

In addition. to update the driver. Special attention may be required to setup the host files and configuration files in Greenplum database as well as the hosts in which Pentaho is installed. you have to restart all affected servers to load the newly installed database driver. PENTAHO DATA INTEGRATION WITH GREENPLUM LOADER 9 . you can define the Greenplum database connections in the Database Connection dialog. in Greenplum database. Port Number. and give the Host Name. the user may need to configure pg_hba. User Name and Password in the Setting section. In addition. choose Greenplum as the Connection Type. if you want to establish a Greenplum data source in the Pentaho Enterprise Console. In brief. choose “Native (JDBC)” in the Access field. Database Name.jar file containing the driver into the libext/JDBC directory. • For Data Integration Server: <Pentaho_installed_directory>/server/dataintegration-server/tomcat/lib/ • For Data Integration client: <Pentaho_installed_directory>/design-tools/dataintegration/libext/JDBC/ For BI Server: <Pentaho_installed_directory>/server/biserver-ee/tomcat/lib/ • For Enterprise Console: <Pentaho_installed_directory>/server/enterpriseconsole/jdbc/ If you installed a new JDBC driver for Greenplum to the BI Server or DI Server. For example. You can give a connection name. Pentaho PDI server and the Greenplum Database) in order to ensure both machines can communicate.conf with the IP address of the Pentaho host.e. Assume that there is a Greenplum Database (GPDB) installed and ready to use. the user would need to update the jar file in /dataintegration/libext/JDBC/. you must install that JDBC driver in both Enterprise Console and the BI Server to make it effective. the user may need to add the hostnames and the corresponding IP address in both systems (i. For instance.Installation of new driver To add a new driver. simply drop/copy the .

and the technology supports both large batch and continuous near-real-time loading patterns with negligible impact on concurrent database operations. enabling ETL applications to stream data into the Greenplum database quickly. In this approach.Greenplum Loader: Greenplum’s Scatter/Gather Streaming Technology Parallel Loading Greenplum's Scatter/Gather Streaming™ (SGS) technology. This technology is exposed via a flexible and programmable external table (explained below) interface and a traditional command-line loading interface. Figure 1 Greenplum’s SGS technology ensures parallelism by scattering data from source systems across 100s or 1000s of parallel streams that simultaneously flow to all nodes of the Greenplum Database. with data automatically partitioned across nodes and optionally compressed. This technology is intended for loading huge data sets that are normally used in large-scale analytics and data warehousing. eliminates the bottlenecks associated to data loading. typically referred to as gpfdist. This technology manages the flow of data into all nodes of the database Figure 1 shows how Greenplum utilizes a parallel everywhere approach to loading. Figure 2 shows how the final gathering and storage of data to disk takes place on all nodes simultaneously. data flows from one or more source systems to every node of the database without any sequential bottlenecks. PENTAHO DATA INTEGRATION WITH GREENPLUM LOADER 10 . Performance scales with the number of Greenplum Database nodes.

you can generate the flat files from an operational database or transactional database. PENTAHO DATA INTEGRATION WITH GREENPLUM LOADER 11 . CSV. parallel data loading of text. It exposes the target file via TCP/IP to a local file directory containing the files. gpfdist can be considered as a networking protocol. The data does not change during queries.Figure 2 External Tables External tables enable users to access data in external sources as if it were in a table in the database. using export. depending on the business requirements. The benefit of using gpfdist is that users can take advantages of maximum parallelism while reading from or writing to external tables. Web tables provide access to dynamic data sources as if those sources were regular database tables. For data uploading into a Greenplum database. Greenplum Parallel File Distribution Server(gpfdist) gpfdist is Greenplum’s parallel file distribution server utility software. XML files into a Greenplum database. the PATH contains the location of the tar and gzip utilities. It is used with readonly external tables for fast. although it can also read tar and gziped files. or user-written software. They have different access methods. The files are usually delimited files or CSV files. external tables contain static data that can be scanned multiple times. external tables and Web tables. Web tables cannot be scanned multiple times. there are two types of external data sources. In the case of tar and gzip files. This process can be automated to run periodically. The data can change during the course of a query. Running gpfdist is similar to running a HTTP server. thereby offering the best performance as well as easier administration of external tables. much like the http protocol. dump. COPY. In Greenplum database.

The use of white space is significant. Using a load specification defined in a YAML formatted control file. gpfdist is set up to run on the Greenplum DIA server. The Greenplum EXTERNAL TABLE feature allows us to define network data sources as tables that we can query to speed up the data loading process.0.How does gpfdist work? gpfdist runs in a client-server model. and a log file is created in /home/gpadmin called etl-log. Optionally. anticipating data loading from flat files stored in a file directory /etl-data. The basic structure of a load control file: --- VERSION: 1. “gpload” executes a load by invoking the Greenplum parallel file server (gpdist) – Greenplum’s parallel file distribution program. creating an external table definition based on the source data defined. and executing an INSERT.log & [1] 28519 # Serving HTTP on port 8887. The gpload program processes the control file document in order and uses indentation (spaces) to determine the document hierarchy and the relationships of the sections to one another. “gpload” is a data loading utility that acts as an interface to Greenplum Database’s external table parallel loading feature. A simple startup of the gpfdist server is the following command syntax: gpfdist –d <file_files_directory> –p <port_number> –l <log_file> & For example: # gpfdist -d /etl-data -p 8887 -l gpfdist_8887. Using gpload to invoke gpfdist Pentaho leverages the parallel bulk loading capabilities of GPDB using the Greenplum data loading utility . To start the gpfdist process.1 DATABASE: db_name USER: db_username HOST: master_hostname PENTAHO DATA INTEGRATION WITH GREENPLUM LOADER 12 . you can indicate the directory where they drop/copy their source files. you may also designate the TCP port number to be used. Port 8887 is opened and listening for data requests. directory /home/gpadmin/etl-log In the above example. White space should not be used simply for formatting purposes.“gpload”. and tabs should not be used at all.0. UPDATE or MERGE operation to load the source data into the target table in the database.

field_name: data_type .ERROR_LIMIT: integer .PORT: master_port GPLOAD: INPUT: .TRUNCATE: true | false .NULL_AS: 'null_string' .FORMAT: text | csv .QUOTE: 'csv_quote_character' .MODE: insert | update | merge .SOURCE: LOCAL_HOSTNAME: .COLUMNS: ./path/to/input_file .DELIMITER: 'delimiter_character' .table_name OUTPUT: .UPDATE_CONDITION: 'boolean_condition' .MATCH_COLUMNS: .hostname_or_ip PORT: http_port | PORT_RANGE: [start_port_range.REUSE_TABLES: true | false SQL: .table_name .FORCE_NOT_NULL: true | false .ESCAPE: 'escape_character' | 'OFF' .TABLE: schema.target_column_name .ENCODING: database_encoding .UPDATE_COLUMNS: .ERROR_TABLE: schema.MAPPING: target_column_name: source_column_name | 'expression' PRELOAD: .target_column_name . end_port_range] FILE: .HEADER: true | false .BEFORE: "sql_command" PENTAHO DATA INTEGRATION WITH GREENPLUM LOADER 13 .

my_load.etl1-4 PORT: 8081 FILE: .yml: --VERSION: 1. Check the environment variables for PATH.category: text .desc: text .amount: float4 . GPHOME_LOADERS and PYTHONPATH are correctly installed. For example.yml It is recommended that we confirm that gpload is running successfully. you can run gpload at the system (command) prompt to verify.0.SOURCE: LOCAL_HOSTNAME: ./var/load/data/* . users can run a load job as defined in my_load. If gpload.py script is not successfully executed. This file is divided into sections for easy reference.yml using gpload: gpload -f my_load.name: text .FORMAT: text PENTAHO DATA INTEGRATION WITH GREENPLUM LOADER 14 . to reduce the chance of future errors.etl1-2 .AFTER: "sql_command" Above example shows syntax for GPLOAD using YAML file. please confirm the following settings: Check if the correct version is installed by checking the gpload readme. By copying a small representation of a source file and a control (YAML) file.1 DATABASE: ops USER: gpadmin HOST: mdw-1 PORT: 5432 GPLOAD: INPUT: .COLUMNS: .date: date .. you can run gpload. As a first step. Check if the pathname environmental variables are pointing or including to the correct path Example of the load control file .etl1-3 .etl1-1 .py using a sample load control file. those horizontal lines are not to be placed in a YAML file.0.

DELIMITER: '|' ..TABLE: payables.ERROR_TABLE: payables. As you can see in the above example.BEFORE: "INSERT INTO audit VALUES('start'. By using Pentaho.expenses . The GPLoad data loading utility is used for massively parallel data loading using Greenplum's external table parallel loading feature.MODE: INSERT SQL: . The following diagrams show the typical deployment scenarios for performing parallel loading to Greenplum Database: PENTAHO DATA INTEGRATION WITH GREENPLUM LOADER 15 . current_timestamp)" . there are some pre-built steps inside the “Bulk loading” folder in the Design windows of Spoon.ERROR_LIMIT: 25 . you do not need to write your own YAML file.err_expenses OUTPUT: . field names and most of the content need to be in a certain format. The customized Greenplum step is called “Greenplum Load”. four ETL servers are used for feeding data into Greenplum through GPLOAD. The “Greenplum Load” step wraps the Greenplum GPLoad data loading utility we just discussed. current_timestamp)" Note: YAML file is not a free formatted file. GPLoad can be implemented in either single or multiple Pentaho ETL servers.AFTER: "INSERT INTO audit VALUES('end'. which will help to generate the YAML file when all the necessary details are provided.

Multiple NICs 2) Multiple ETL Servers PENTAHO DATA INTEGRATION WITH GREENPLUM LOADER 16 .1) Single ETL Server.

Usage: How to use Greenplum Loader in Pentaho Data Integration Setup Here are the steps to setup a simple transformation to test out the Greenplum Loader: 1) Create the Text File Input Steps by defining a source file (e. Choose ‘Text File Input’ component under Design tab and inside Input folder: Double Click on the Text File Input and choose the right input delimited file. delimited file). PENTAHO DATA INTEGRATION WITH GREENPLUM LOADER 17 . csv.g.

16|0.dat should look like this: 1|155190|7706|1|17|21168.04|0.02|N|O|1996-03-13|1996-02-12|1996-0322|DELIVER IN PERSON|TRUCK|lineitem 1 comments 2|67310|7311|2|36|45983.2.09|0.04|A|F|1993-10-29|1993-12-19|1993-1108|COLLECT COD|TRUCK|lineitem 100 comments PENTAHO DATA INTEGRATION WITH GREENPLUM LOADER 18 .23|0.23|0.09|0.06|N|O|1996-04-12|1996-02-28|1996-0420|TAKE BACK RETURN|MAIL|lineitem 2 comments ……. Click on the next tab of Contents to define how to parse the CSV file: 3. Go to the next tab Fields and click on Get Fields to define all the fields: A sample source file lineitem. 100|61336|8855|1|31|40217.csv/lineitem.

2).4. l_shipdate date.2). l_tax numeric(15. you will need to create the Greenplum Load Step: PENTAHO DATA INTEGRATION WITH GREENPLUM LOADER 19 . l_returnflag character(1). You should create a target table called “lineitem” which contains: CREATE TABLE lineitem ( l_orderkey integer. l_comment character varying(44) ) WITH ( OIDS=FALSE ) DISTRIBUTED BY (l_orderkey). l_commitdate date.2). l_shipmode character(10). ALTER TABLE lineitem OWNER TO gpadmin. l_receiptdate date. l_partkey integer. l_quantity numeric(15.2). l_shipinstruct character(25). l_extendedprice numeric(15. Next. l_suppkey integer. l_discount numeric(15. l_linestatus character(1). l_linenumber integer.

click on the Edit Mapping button to define all the mappings from the sources to targets: PENTAHO DATA INTEGRATION WITH GREENPLUM LOADER 20 . please click on Get fields button in order to generate all the target table fields: After that.The details of the Greenplum Load step need to be defined as the following: First. Then. you have to choose the correct connection and target table.

Next. please click OK to save. data file location: Once you complete the definitions. PENTAHO DATA INTEGRATION WITH GREENPLUM LOADER 21 . A sample job can be created through adding the Hop between the Text Input and Greenplum Load steps. control file. go to the GP Configuration tab in order to define the correct GPLOAD.

Future expansion and interoperability Both Greenplum and Pentaho are rapidly innovating and extending their capabilities to satisfy the requirements in the BIG DATA industry. Therefore. lineitem through gpload. and leverages a growing number of data integration applications such as Pentaho. you can execute the transformation/job by click the GREEN arrow on the top left corner. You can also verify if data is loaded into this target Greenplum database table. and micro-batch loading. you can check the Logging and Step Metrics sections to see if the transformation is successfully executed. the EMC Data Integration Accelerator (DIA) is purpose-built for batch loading. The above transformation is just a sample.When everything is defined and saved. both companies are working together to expand their interoperability to adopt the constantly growing demands. user can add different components in this transformation or incorporate into a well developed job for transforming the data. In order to meet the challenges of fast data loading. Once the execution is finished. therefore. PENTAHO DATA INTEGRATION WITH GREENPLUM LOADER 22 .

Conclusion In this white paper. PENTAHO DATA INTEGRATION WITH GREENPLUM LOADER 23 . It covers the preliminary interoperability between both Pentaho PDI and Greenplum database for data integration and business intelligence projects by using Greenplum’s Scatter/Gather Streaming Technology embedded in Greenplum Loader. the process of how to use Greenplum Loader Step(GPLOAD) to enhance the loading capability and performance of Pentaho Data Integration is discussed.

1 Load Tools for Windows guide 5) Pentaho Community .com 3) Greenplum Database 4.References 1) Pentaho Kettle Solutions – Building Open Source ETL Solutions with Pentaho Data Integration (ISBN-10: 0470635177 / ISBN-13: 978-0470635179) 2) Getting Started with Pentaho Data Integration guide from www.pentaho.Greenplum Load PENTAHO DATA INTEGRATION WITH GREENPLUM LOADER 24 .1 Load tools for UNIX guide 4) Greenplum Database 4.

Sign up to vote on this title
UsefulNot useful