You are on page 1of 283

Talend Data Integration

Subramanyam K
Agenda

•Introduction to Talend / Talend Overview


•Nested Jobs
•Introduction to ETL
•Error Handling/ Error Logging
•Introduction to Talend Open Studio
• Deployment of Jobs
•Talend Data Integration Architecture
•Job version controlling in Talend
•Talend Studio
•Dynamic schema
•Metadata Creation (Files /Database)
•Performance Tuning Tips
•Files Components
•Data Processing Components
•Context Variables
•Incremental Loads
•SCD (SLOWLY CHANGING DIMENSIONS )
•Windows/Unix commands component in Talend
Talend Overview
Talend Overview

 By Creating Talend In late 2005 by Diard and Fabrice Bonan


 Founded in 2006
 Open Core business model
• Subscription license
• Services & training
• Optional support
 Speed, Scalability, Simplicity, best Big Data capabilities on the Market, Unified Platform,
No duplication
 Distributed architecture
 900+ Components & Connectors
 Talend Unified Platform: Data Integration, Data Quality, Master Data Management,
Enterprise Service Bus, Business Process Management
Talend Overview
 Open Source/Enterprise Edition.
 Java Code Generator.
 Based on the Eclipse framework
 Talend comes with many products like
• Talend for Data Integration
• Talend for Big data
• Talend for BPM
• Talend for Data Management
• Talend for ESB

 Integration to Hadoop (Big Data) - Hive, Pig, Sqoop, Map-Reduce.


Introduction to Talend

 Talend is a company focused on Data Integration and Data Management solutions. Talend is a “Cool
Vendor“ for Gartner (2010).Present in more than 12 locations around the World and it is a fast growing
company.

 Talend Open Studio for Data Integration is an open source data integration product developed by Talend.

 Data Integration involves combining data residing in different sources and providing the user with a
unified view of the data.
Introduction to ETL
Extraction Transformation Loading (ETL)

● ETL is a common process in Data Integration


● Extract, reading data from different datasources (database, flat files,spreadsheet files, webservices, etc)
● Transfom, converting data in a form so that it can be placed in another container (database, webservices, files, etc).
Cleaning, computations and verifications are also performed
● Load, write the data in the target format
ETL
Data Integration Definition

Data integration refers to the technical and business processes used to combine data from multiple
sources to provide a unified, single view of the data.
Talend Offerings
Talend Offerings / Talend Unified Platform
Talend Unified Platform
Talend Architecture
Talend Data Integration - Architecture
Functional Architecture
How it works?

■ Talend is a code generator ETL which use JAVA as the underline technology generated to
perform the data Extraction, Transformation and Loading.
■ For database connectivity: Talend uses ODBC and JDBC drivers delivered and certified by the
database vendors themselves (Oracle, DB2 IBM, Teradata, SQL Server, MySQL, etc... )
■ Implements the BULK LOADER of those Database vendors following their API or executables.
■ For Files connectivity: regarding the format, Talend leverages different library underneath. File
delimited is straight forward, but for XML or JSON it uses the proper lib for that XPathQuery, etc.
■ For CRM, ERP and other business application: Talend integrates through Web Services API
provided by the business application vendors
■ For SAP: Talend reuses the JCO connector/lib which is certified and provided by SAP
themselves, to reach out RFC, BAPI functions; Talend also provides a iDoc connector for SAP,
etc.
■ Talend also provides connector and protocol support for : FTP (FTPS or SFTP), SCP, SOAP,
REST, WEBSERVICE, RSS, LDAP etc.
TALEND STUDIO
Talend System Requirement
The following are the system requirements to download and work on Talend Open Studio −

Recommended Operating system


 Microsoft Windows 10
 Ubuntu 16.04 LTS
 Apple macOS 10.13/High Sierra
Memory Requirement
 Memory - Minimum 4 GB, Recommended 8 GB
 Storage Space - 30 GB
Besides, you also need an up and running Hadoop cluster (preferably Cloudera.)
Note − Java 8 must be available with environment variables already set.
Talend Installation
Talend Studio
Talend Data Integration Studio [cont…]

1. Repository: All the metadata is accessible in this section which


includes Jobs, Data Object Definitions, Connection Objects,
Routines, etc.

2. Workspace: Job design and development is done here.

3. Properties: Configuration of Components, Jobs, Context Variables


can be done in here.

4. Palette: All components and connectors can be browsed and selected


from.
Talend Data Integration Studio
Differnce between Talend Open Studio and Enterprise Editon
Metadata
Initial Setup – Create Connection/MetaData
• Create Connection
• Allows connection setup for following, amongst others:

 DB connection(Relational)
 JDBC schema
 SAS connection,
 file schema
 FTP connection
 LDAP schema
 salesforce schema
 Generic schema
 MDM connection
 WSDL schema
Initial Setup – Create File MetaData

Create Flat File Delimited Connection


Step-1: Expand Metadata in the Repository tree view and right-click “File Delimited”
and select “Create File Delimited”
Step-2: In connection wizard, fill generic connection information
Step-3: In next screen, Select Format of OS ( Windows/ UNIX) and browse the file
Step-4: Set all file related details like field separator etc
Step-5 : Click Next
Step-6 : Click Finish
Metadata Creation
Metadata Continue..
Metadata Creation Completed.
Initial Setup – Create Connection - Relational

Create Relational (DB) Connection


Step-1: Expand Metadata in the Repository tree view and right-click Db Connections
and select Create connection
Step-2: In connection wizard, fill generic connection information
Step-3: In next screen, select type of Db connection and provide all login details and test connection.
Step-4: Once connection gets created, on Metadata folder select connection ,
right click “Retrieve Schema” and on next wizard select relevant data object types to pull in
Step-5: Select tables required to import
Step-6 : Click Next
Step-7 : Click Finish
Importing a Database Object
1. Create a Database connection
2. Retrieve schemas from Database using the connection
created above
Importing a Database Object [cont’d…]
tFileInputDelimited
 tFileInputDelimited reads a given file row by row with simple separated fields. to split them up
into fields then sends fields as defined in the schema to the next Job component, via a row link.

tFileInputDelimited properties
• File Name/Stream - Name and path of the file to be processed.

• Row separator - String (ex: "\n“ on Unix) to distinguish rows.

• Field separator - Character, string or regular expression to separate fields.


tLogRow

• tLogRow displays data or results in the Run console. It is used to


monitor data processed.

o tLogRow properties
• Basic - Displays the output flow in basic mode.
• Table - Displays the output flow in table cells
• Separator - Enter the separator which will delimit data on the Log display.
Creating a first Job

Create Job from Job design


Step-1: Right click on job design from ‘Repository’ on the left.
Step-2: Next click “create job” to open new job wizard and provide the job details ( Job Name, description, purpose) and click finish.
Step-3: Click components from ‘Palette’ on the right to drag and drop it on the job designer ( like input and output components either file or DB).
Step-4: Select tables/Files required to import.
Step-5: To provide the details for particular component in ‘Component’ view from ‘configuration’ Tab.
Step-4: Provide / import context values if required from ‘Context’ view from configuration tab (Optional).
Step-6: Click ‘Run’ view from configuration tab then click run button to run the job.
Creating a first Job
Let’s create a Job that reads data from a Delimited file and writes
into a Database table
Creating a first Job [cont’d…]
Talend Job Creation
Talend Job Creation
Talend Job Creation continue..
Talend Job Creation continue..
Talend Job Creation continue..
Talend Job Creation continue..
Talend Job Creation continue..
Talend Job Creation continue..
Talend Job Creation continue..
Talend Job Creation continue..
Talend Job Creation continue..
Talend Job Creation continue..

tMap transforms and routes data from single or multiple sources to single or
multiple destinations.
Talend Job Creation continue..
Talend Job Creation completed.
Create First Job
Step-1 : On Job Design palate , right click and “Create a job”
Step-2 :Drag Connection ( as applicable) from Metadata pallaete , and choose Input / Output as applicable
Step 3 : Right click on source ( Main ) and drag till target
Step 4 : Update Source target location as applicable
Click Run
COMPONENTS

57
COMPONENTS
PALLETE VIEW
What is a component?

A component is a preconfigured connector used to


perform a specific data integration operation. so it can minimize the
amount of hand coding required to work on data from multiple,
heterogeneous sources

• Basically a component is a functional piece that performs


a single operation.
• Graphically, a component is an icon that you can drag and drop
from the Palette to the workspace

There are around 900+ components in Talend and it is


subdivided in to different subfamily
Components of Talend DI
Components of Talend DI

Big Data components Databases - other components Misc. group components

Business components DotNET components Orchestration components

Business Intelligence components ELT components Processing components

Cloud components ESB components System components

Custom Code components File components Talend MDM components

Data Quality components Internet components Technical components

Databases - traditional components Logs & Errors components XML components

Databases - appliance/ data


warehouse components
Database Component
 These include connectors for the most popular and traditional databases.

 These connectors cover various needs, including:


 opening connections,

 reading and writing tables,

 committing transactions as a whole,

 performing rollback for error handling.

 Over 40 RDBMS are supported. Other types of database connectors, such as connectors
for Appliance/DW databases and database management.

 There are about 346 components under the Database component category.
Database Component (Contd..)
Databases - Traditional components
tAccessBulkExec tDB2Commit tInformixSP tMSSqlTableList tOleDbRow tPostgresqlOutput
tMemSQLClose -
tAccessClose tDB2Connection Databases tMysqlBulkExec tOracleBulkExec tPostgresqlOutputBulk
tMemSQLConnection - tPostgresqlOutputBulkEx
tAccessCommit tDB2Input Databases tMysqlClose tOracleClose ec
tMemSQLInput -
tAccessConnection tDB2Output Databases tMysqlColumnList tOracleCommit tPostgresqlRollback
tMemSQLOutput -
tAccessInput tDB2Rollback Databases tMysqlCommit tOracleConnection tPostgresqlRow
tMemSQLRow - tPostgresqlSCD -
tAccessOutput tDB2Row Databases tMysqlConnection tOracleInput Databases
tPostgresqlSCDELT -
tAccessOutputBulk tDB2SCD - Databases tMSSqlBulkExec tMysqlInput tOracleOutput Databases

tAccessOutputBulkExec tDB2SCDELT - Databases tMSSqlColumnList tMysqlLastInsertId tOracleOutputBulk tSybaseBulkExec

tAccessRollback tDB2SP tMSSqlClose tMysqlLookupInput tOracleOutputBulkExec tSybaseClose

tAccessRow tInformixBulkExec tMSSqlCommit tMysqlOutput tOracleRollback tSybaseCommit

tAS400Close tInformixClose tMSSqlConnection tMysqlOutputBulk tOracleRow tSybaseConnection

tAS400Commit tInformixCommit tMSSqlInput tMysqlOutputBulkExec tOracleSCD - Databases tSybaseInput


tOracleSCDELT -
tAS400Connection tInformixConnection tMSSqlLastInsertId tMysqlRollback Databases tSybaseIQBulkExec
Database Component (Contd..)

Databases - Appliance/Datawarehouse components

tGreenplumBulkExec tIngresBulkExec tNetezzaClose tParAccelConnection tRedshiftInput tTeradataFastLoadUtility

tGreenplumClose tIngresClose tNetezzaCommit tParAccelInput tRedshiftOutput tTeradataInput

tGreenplumCommit tIngresCommit tNetezzaConnection tParAccelOutput tRedshiftOutputBulk tTeradataMultiLoad


tRedshiftOutputBulkExe
tGreenplumConnection tIngresConnection tNetezzaInput tParAccelOutputBulk c tTeradataOutput
tParAccelOutputBulkExe
tGreenplumGPLoad tIngresInput tNetezzaNzLoad c tRedshiftRollback tTeradataRollback

tGreenplumInput tIngresOutput tNetezzaOutput tParAccelRollback tRedshiftRow tTeradataRow


tTeradataSCD -
tGreenplumOutput tIngresOutputBulk tNetezzaRollback tParAccelRow tRedshiftUnload Databases
tParAccelSCD - tTeradataSCDELT -
tGreenplumOutputBulk tIngresOutputBulkExec tNetezzaRow Databases tTeradataClose Databases
tGreenplumOutputBulkE tNetezzaSCD -
xec tIngresRollback Databases tRedshiftBulkExec tTeradataCommit tTeradataTPTExec

tGreenplumRollback tIngresRow tParAccelBulkExec tRedshiftClose tTeradataConnection tTeradataTPTUtility

tGreenplumRow tIngresSCD - Databases tParAccelClose tRedshiftCommit tTeradataFastExport tTeradataTPump


tGreenplumSCD -
Databases tNetezzaBulkExec tParAccelCommit tRedshiftConnection tTeradataFastLoad tVectorWiseCommit
Database Component (Contd..)

Databases - Other components


tCassandraBulkExec - tCouchDBClose - tHiveCreateTable -
Databases Databases tEXAInput tFirebirdCommit Databases tInterbaseRollback
tCassandraClose - tCouchDBConnection -
Databases Databases tEXAOutput tFirebirdConnection tHiveInput - Databases tInterbaseRow
tCassandraConnection - tCouchDBInput -
Databases Databases tEXARollback tFirebirdInput tHiveLoad - Databases tJavaDBInput
tCassandraInput - tCouchDBOutput -
Databases Databases tEXARow tFirebirdOutput tHiveRow - Databases tJavaDBOutput
tCassandraOutput -
Databases tCreateTable tEXistConnection tFirebirdRollback tHSQLDbInput tJavaDBRow
tCassandraOutputBulk -
Databases tDBInput tEXistDelete tFirebirdRow tHSQLDbOutput tJDBCColumnList
tCassandraOutputBulkEx
ec - Databases tDBOutput tEXistGet tHBaseClose - Databases tHSQLDbRow tJDBCClose
tCassandraRow - tHBaseConnection -
Databases tDBSQLRow tEXistList Databases tInterbaseClose tJDBCCommit
tCouchbaseClose -
Databases tEXABulkExec tEXistPut tHBaseInput - Databases tInterbaseCommit tJDBCConnection
tCouchbaseConnection - tHBaseOutput -
Databases tEXAClose tEXistXQuery Databases tInterbaseConnection tJDBCInput
tCouchbaseInput -
Databases tEXACommit tEXistXUpdate tHiveClose - Databases tInterbaseInput tJDBCOutput
tCouchbaseOutput - tHiveConnection -
Databases tEXAConnection tFirebirdClose Databases tInterbaseOutput tJDBCRollback
Different Types of DB available in Talend
tOracleComponents

Opens a connection to the database for a current transaction. This


component is to be used along with tOracleCommit and tOracleRollback

commits in one go a global transaction instead of doing that on every row or


every batch and thus provides gain in performance.

Cancel the transaction commit in the connected DB.

tOracleInput reads a database and extracts fields based on a query.


tOracleRow is the specific component for this database query. It executes the
SQL query stated onto the specified database and the query should be an
insert update and delete.

tOracleOutput writes, updates, makes changes or suppresses entries in a


database.

tOracleSP calls the database stored procedure.

Executes the Insert action on the data provided.


It allows in performance gain
tOracleOutput
Example job to demonstrate tOracleComponents
ELT
 The ELT family groups together the most popular database connectors and processing
components.

 This mode supports all of the most popular databases including Teradata, Oracle, Vertica,
Netezza, Sybase, etc

 In ELT data is migrated in bulk according to the data set and the transformation process
occurs after the data has been loaded into the targeted DBMS in its raw format.

 Less stress is placed on the network and larger throughput is gained.

 For example: As SQL is less powerful than Java, the scope of available data transformations
is limited. ELT requires users that have high proficiency in SQL tuning and DBMS tuning.
ELT (Contd..)
Types of ELT Components

tAccessConnection - Elt tELTHiveInput tELTMysqlOutput tELTSybaseMap tHiveConnection - Elt tSAPHanaConnection - Elt

tAS400Connection - Elt tELTHiveMap tELTNetezzaInput tELTSybaseOutput tIngresConnection - Elt tSQLiteConnection - Elt

tCombinedSQLAggregate tELTHiveOutput tELTNetezzaMap tELTTeradataInput tInterbaseConnection - Elt tSQLTemplate

tCombinedSQLFilter tELTJDBCInput tELTNetezzaOutput tELTTeradataMap tJDBCConnection - Elt tSQLTemplateAggregate

tCombinedSQLInput tELTJDBCMap tELTOracleInput tELTTeradataOutput tMSSqlConnection - Elt tSQLTemplateCommit


tSQLTemplateFilterColumn
tCombinedSQLOutput tELTJDBCOutput tELTOracleMap tELTVerticaInput tMysqlConnection - Elt s

tDB2Connection - Elt tELTMSSqlInput tELTOracleOutput tELTVerticaMap tNetezzaConnection - Elt tSQLTemplateFilterRows

tELTGreenplumInput tELTMSSqlMap tELTPostgresqlInput tELTVerticaOutput tOracleConnection - Elt tSQLTemplateMerge

tELTGreenplumMap tELTMSSqlOutput tELTPostgresqlMap tEXAConnection - Elt tParAccelConnection - Elt tSQLTemplateRollback


tPostgresPlusConnection -
tELTGreenplumOutput tELTMysqlInput tELTPostgresqlOutput tFirebirdConnection - Elt Elt tSybaseConnection - Elt
tVectorWiseConnection - tGreenplumConnection -
Elt tELTMysqlMap tELTSybaseInput Elt tPostgresqlConnection - Elt tTeradataConnection - Elt
File Components

OUTPUT
INPUT tFileOutputDelimited
tFileInputDelimited

tFileInputExcel tFileOutputExcel

tFileInputXML
tFileOutputXML

tFileInputProperties

tFileOutputProperties

tFileInputRegex
File Input Components
 tFileInputDelimited: Reads a given file row by row with simple separated fields.

 tFileInputExcel: Reads an Excel file (.xls or .xlsx) and extracts data line by line

 tFileInputXML: Reads an XML structured file and extracts data row by row.

 tFileInputMSDelimited: Reads and outputs multiple schema within an separated file.

 tFileInputMSXML: Reads and outputs multiple schema within an XML structured file.
tFileInputDelimited
 tFileInputDelimited reads a given file row by row with simple separated fields. to split them up
into fields then sends fields as defined in the schema to the next Job component, via a row link.

tFileInputDelimited properties
• File Name/Stream - Name and path of the file to be processed.

• Row separator - String (ex: "\n“ on Unix) to distinguish rows.

• Field separator - Character, string or regular expression to separate fields.


File Output Components
 tFileOutputDelimited: Writes to a delimited file row by row with fields simple separated by
comma, tab etc.

 tFileOutputMSDelimited: Writes a file row by row based on the multiple schema and
pattern in a delimited file.

 tFileOutputExcel: Writes the cells in MS Excel file row by row with separated data value
according to a defined schema.

 tFileOutputXML: Writes an XML file with separated data values according to an XML tree
structure. XML structure is created from rows broken into fields.
tFileOutputDelimited

• tFileOutputDelimited writes a delimited file that holds data organized according to the defined schema. It is
an outputs data to a delimited file.

• tFileOutputDelimited properties
• File name - File name with path
• Row Separator - “\n”
• Field Separator - “,”
• Append - check box.
• Include Header - check box.
• Compress as zip file - check box.
File management Components
There are various components that helps in implementing operations(Delete, compare , copy, archive
etc.) on day-to-day files.
 tFileArchive: This component creates a new zip, gzip , or tar.gz archive file from one or more
specified files or folders using different compression method.

 tFileList: Retrieves a set of files or folders based on a filemask pattern and iterates on each set of
files or folders.

 tFileCompare: Compares two files and provides comparison data. Helps at controlling the data
quality of files being processed.

 tFileCopy: This component copies a source file or folder into a target directory.

 tFileDelete: Deletes the files from the defined directories.

 tFileExist: Checks if a file exists or not.


File management Components continue..

tFileDelete

tFileCopy

tFileList

tFileRowCount

tFileUnarchive

tFileExist tFileCompare
Introduction to various key components
 tMap: Allows joins, filtering, transformations and routes data from single or multiple sources to single or multiple targets.
Function: tMap is an advanced component, which integrates itself as plugin to Talend Studio.
Purpose: tMap transforms and routes data from single or multiple sources to single or multiple destinations.
Usage: Possible uses are from a simple reorganization of fields to the most complex Jobs of data multiplexing or demultiplexing
transformation, concatenation, inversion, filtering and more...

 tJoin: Performs inner or left outer join between the main data flow and a lookup flow.
Function: tJoin joins two tables by doing an exact match on several columns. It compares columns from the
main flow with reference columns from the lookup flow and outputs the main flow data and/or the rejected data.
Purpose: This component helps you ensure the data quality of any source data against a reference data source.
Usage: This component is not startable and it requires two input components and one or more output component.
 tSortRow: tSortRow sorts input data based on one or several columns, by sort type and order. It allows multi- column Sorting
on input data providing advanced sorting capabilities ( asc/desc sorting, alphabetical sorting). tSortRow component belongs to the
Processing family.
 tFilterRow: Filters input rows by setting conditions on the selected columns.
Function : tFilterRow filters input rows by setting one or more conditions on the selected columns.
Purpose: tFilterRow helps parametrizing filters on the source data.
Usage: This component is not startable (green background) and it requires an output component.
Introduction to various key components
 tConvertType: Allows specific conversions at run time from one Talend java type to another
type.

 tAggregateRow: Receives a flow and aggregates it based on one or more columns. For each
output line, are provided the aggregation key and the relevant result of set operations (min, max,
sum,Avg,Count, etc...).

 tUniqRow: Makes a data flow unique based on the key set on the schema.

 tUnite: Merges multiple inputs to single output.


Introduction to various key components
 tReplicate: Create multiple out put sets from single input set.
Function: Duplicate the incoming schema into two identical output flows.
Purpose: Allows you to perform different operations on the same schema.
Usage: This component is not startable (green background), it requires an Input component and an output component.

 tRowGenerator: To generates sample data.

 tDenormalize: Concatenates different fields into an array or a delimited string.

 tNormalize: Normalizes a flat row to multiple rows.

 tRunjob: tRunJob belongs to two component families: System and Orchestration. For more information on tRunJob
Funcation: tRunJob executes the Job called in the component's properties, in the frame of the context defined.
Purpose: tRunJob helps mastering complex Job systems which need to execute one Job after another.
Usage: This component can be used as a standalone Job or can help clarifying complex Job by avoiding having too many sub-jobs all together
in one Job.
If you want to create a reusable group of components to be inserted in several Jobs or several times in the same Job, you can use a Joblet.
Unlike the tRunJob, the Joblet uses the context variables of the Job in which it is inserted. For more information on Joblets, see Talend Studio User Guide.
Note : This component also allows you to call a Job of a different framework, such as a Spark Batch Job or a Spark Streaming Job.
Connection Types

There are various types of connections which define either the data to be processed, the data output, or the Job logical sequence

There are three ways to Connect data


1. Row
2. Trigger
3. Iterator

Row
A Row connection handles the actual data. The Row connections can be main, lookup, reject or output according to the nature of the flow
processed.
- Main
This type of row connection is the most commonly used connection. It passes on data flows from one component to the other, iterating on each
row and reading input data according to the component properties setting (schema).
-Lookup
multiple input flows.
- Output
Trigger

Trigger connections define the processing sequence, i.e. no data is handled through these connections.

Triggers
• Sub Job Triggers
• Run If Triggers
• On Component Triggers

Iterator

• The Iterate connection can be used to loop


on files contained in a directory, on rows
contained in a file or on DB entries.
• The Iterate link is mainly to be connected to
the start component of a flow (in a subjob).
Creation of MetaData

82
Files & XMLs
Delimited file
 Follow Below steps to import Delimited file.
Importing a Delimited File
 Follow below steps to import delimited file.
Importing a Regex File
 Follow below steps to import delimited file.
Importing an Excel File
 Follow below steps to import Excel file Metadata.
tFileList

 Iterates on files or folders of a set directory. Retrieves a set of files or folders


based on a filemask pattern and iterates on each unity.
tFileTouch

 Either creates an empty file or, if the specified file already exists, updates its
date of modification and of last access while keeping the contents unchanged.
tWaitForFile

 Iterates on the specified directory and triggers the next component when the
defined condition is met.
 This component is used to put the component connected with it in waiting
state. It then triggers that component when the defined file operation occurs in
the specified directory.
tFileExist

 Checks if a file exists or not & helps to streamline processes by automating


recurrent and tedious tasks such as checking if a file exists
tFileInputFullRow
 Reads full rows in delimited/fixed width file.
 it allows user to create own schema

 Output
tFileInputRaw
 Reads full rows in delimited/fixed width file.
 it does not allows user to create own schema

 Output
Creating an Input XML Definition
 Follow below steps to import XML file.
Creating an Output XML Definition
tExtractXMLField

 Reads an input XML field of a file or a database table and extracts desired
data.
Data Processing Components

97
Introduction to various key components
• tMap: Allows joins, filtering, transformations and routes data from single or
multiple sources to single or multiple targets.
• tJoin: Performs inner or left outer join between the main data flow and a
lookup flow.
• tSortRow: Allows multi-column Sorting on input data providing advanced
sorting capabilities(asc/desc sorting, alphabetical sorting).
• tFilterRow: Filters input rows by setting conditions on the selected columns.
• tConvertType: Allows specific conversions at run time from one Talend java
type to another type.
• tAggregateRow: Receives a flow and aggregates it based on one or more
columns. For each output line, are provided the aggregation key and the
relevant result of set operations (min, max, sum...).
Introduction to various key components continue
• tUniqRow: Makes a data flow unique based on the key set on the schema.
• tUnite: Merges multiple inputs to single output.
• tReplicate: Create multiple out put sets from single input set.
• tRowGenerator: To generates sample data.
• tDenormalize: Concatenates different fields into an array or a delimited string.
• tNormalize: Normalizes a flat row to multiple rows.
• tLogRow displays data or results in the Run console,it is used to monitor data processed.
• tParallelize: The tParallelize component is an Orchestration component. The tParallelize
(available in Enterprise Edition) component is used for multiple executions of the jobs at
the same time. This can be achieved by multi-threading the jobs.
• tRun: tRun component allows you to embed one Talend Job within another so that it
may be executed as a Talend SubJob.
• tJava, tJavaRow, tJavaFlex
tReplicate
Duplicate the incoming schema into two identical output flows. Allows you to perform different operations on the
same schema.
Usage
This component is not startable (green background), it requires an Input component and an output component.

tSplitRow
tSplitRow splits one row into several rows. This component helps splitting one input row into several output rows.
Usage
This component splits one input row into multiple output rows by mapping input columns onto output columns.
tFlowToIterate
tFlowToIterate transforms a data flow into a list. Allows you to transform a processable flow into non processable data.
Usage
You cannot use this component as a start component. tFlowToIterate requires an output component.
tIterateToFlow
tIterateToFlow transforms a list into a data flow that can be processed. Allows you to transform non processable data
into a processable flow.
Usage
This component is not startable (green background) and it requires an output component.
tFilterRow
• tFilterRow filters input rows by setting one or more conditions on the selected columns. It helps parametrizing
filters on the source data. It has a single input link and output link as well as optionally it can have a reject link
which can be used to capture the unmatched data.

Sample Job:

• Output :
tFilterColumns
• tFilterColumns makes helps homogenize schemas either on the columns order or by removing unwanted
columns or adding new columns.
• Makes specified changes to the schema defined, based on column name mapping.
 Out Put
Sample Job:
tSortRow
 Funcation: Sorts input data based on one or several columns, by sort type and order
 Purpose: Helps creating metrics and classification table.
 Usage: This component handles flow of data therefore it requires input and output, hence is
defined as an intermediary step.

 Output:
tExternalSortRow

 Funcation: Uses an external sort application to sort input data based on one or several columns,
by sort type and order
 Purpose: Helps creating metrics and classification table.
 Usage: This component handles flow of data therefore it requires input and output, hence is
defined as an intermediary step.

 Output:
tAggregateRow

 Function: tAggregateRow receives a flow and aggregates it based on one


or more columns. For each output line, are provided the aggregation key
and the relevant result of set operations (min, max, sum...).
 Purpose: Helps to provide a set of metrics based on values or calculations.

 Output:
tAggregateSortedRow

 Function: tAggregateSortedRow aggregates the sorted input data for output column based on a set of
operations.
 Purpose : Each output column is configured with many rows as required, the operations to be carried out
and the input column from which the data will be taken for better data aggregation.
 Receives a sorted flow and aggregates it based on one or more columns. For each output line, are
provided the aggregation key and the relevant result of set operations (min, max, sum...).
 Input rows count : Specify the number of rows that are sent to the tAggregateSortedRow component.

Note : If you specified a Limit for the number of rows to be processed in the input component, you will
have to use that same limit in the Input rows count field.
tAggregateSortedRow continue..
• tAggregateSortedRow helps to provide a set of metrics based on values or calculations. As the input flow is
meant to be sorted already, the performance are hence greatly optimized.
• Receives a sorted flow and aggregates it based on one or more columns. For each output line, are provided
the aggregation key and the relevant result of set operations (min, max, sum...).

Input row count is mandatory field and this


could be a static or dynamic value.

Output:
tAggregateSortedRow

 Receives a sorted flow and aggregates it based on one or more columns. For
each output line, are provided the aggregation key and the relevant result of
set operations (min, max, sum...).

Input row count is


mandatory field and this
could be a static or dynamic
value.

 Output:
tNormalize
• Normalize the denormalized data.
• Function: Normalizes the input flow following SQL standard.
• Purpose: tNormalize helps improve data quality and thus eases the data update.

• How to split a multi valued attributes column into individual rows using tNormalize ?

 Input

 Out put
Usage
This component can be used as
intermediate step in a data flow.
tDenormalize
• Function: Denormalizes the input flow based on key/One column
• Purpose: tDenormalize helps synthesize the input flow.
• Usage: This component can be used as intermediate step in a data flow.
• How to combine multiple records to single records using tDenormalize
Input

Out Put:
tDenormalizeSortedRow
 Combines in a group all input sorted rows. Distinct values of the denormalized sorted row are joined with item separators
 Funcation: tDenormalizeSortedRow combines in a group all input sorted rows. Distinct values of the denormalized sorted row are joined with
item separators.
 Purpose: tDenormalizeSortedRow helps synthesizing sorted input flow to save memory.
 InputRowCount: Enter the number of input rows.
 To denormalize: Enter the name of the column to denormalize.
 Usage: This component handles flows of data therefore it requires input and output components.

Input row count is mandatory field and this


could be a static or dynamic value.

Out Put:
tExtractDelimitedFields

 Function: tExtractDelimitedFields generates multiple columns from a given string column in a delimited file.
 Purpose: tExtractDelimitedFields helps to extract 'fields' from within a string to write them else where for example.
 Usage: This component handles flow of data therefore it requires input and output components. It allows you to
extract data from a delimited field, using a Row > Main link, and enables you to create a reject flow filtering data which
type does not match the defined type.

 Out Put:
tUniqRow – Data quality component

 Function: Compares entries and sorts out duplicate entries from the input flow.
 Purpose: Ensures data quality of input or output flow in a Job.
 Usage: This component handles flow of data therefore it requires input and output, hence is defined
as an intermediary step.

 Output:
Merge component
tUnite

 Function: Merges data from various sources, based on a common schema.


 Warning - tUnite cannot exist in a data flow loop. For instance, if a data flow goes through several tMap component to
generate two flows, they cannot be fed to tUnite.
 Purpose: Centralize data from various and heterogeneous sources,To display the data with duplication and like as union all
in Oracle.
 Usage: This component is not startable and requires one or several input components and an output component.

 Out Put:
tConvertType

• tConvertType allows specific conversions at runtime from one Talend java type to another.

• it helps to automatically convert one talend java type to another and thus avoid compiling errors.

• Output:
tWaitForFile

 Function: tWaitForFile iterates on the specified directory and triggers the next component when the defined condition is met.

 Purpose: This component is used to put the component connected with it in waiting state. It then triggers that component
when the defined file operation occurs in the specified directory.

 Usage : This component plays the role of triggering the next component based on the defined condition. Therefore this
component requires another component to be connected to it via a link.
tReplace
 Search the character/pattern/ word and then replace with the given pattern.
 Function: Carries out a Search & Replace operation in the input columns defined.
 Purpose: Helps to cleanse all files before further processing.
 Usage: This component is not startable as it requires an input flow. And it requires an output component.

 Input

 OutPut:
tReplaceList -- Data Quality component

 Search the character/pattern/ word and then replace with the given pattern.
 Function: Carries out a Search and Replace operation in the input columns defined based on an external lookup.
 Purpose: Helps to cleanse all files before further processing.
 Usage: tReplaceList is an intermediary component. It requires an input flow and an output component.
tReplicate

 Function:Duplicate the incoming schema into two identical output flows.


 Purpose: This component allows you to perform different operations on the same schema.
 tReplicate belongs to two component families: Processing and Orchestration.
 Usage: This component is not startable (green background), it requires an Input component and an output
component.
 OutPut:
tSchemaComplianceCheck -- Data quality family.

 Function: Validates all input rows against a reference schema or check types, nullability, length of rows against reference
values. The validation can be carried out in full or partly.
 Purpose: Helps to ensure the data quality of any source data against a reference data source.
 Usage: This component is an intermediary step in the flow allowing to exclude from the main flow the non-compliant data.
This component cannot be a start component as it requires an input flow. It also requires at least one output component to
gather the validated flow, and possibly a second output component for rejected data using Rejects link.
tSampleRow

 Function: tSampleRow filters rows according to line numbers..


 Purpose: tSampleRow helps to select rows according to a list of single lines and/or a list of groups of lines.
 The Talend Sample Row or tSampleRow takes an input, selects the range of rows from it, and returns an
output.

 OutPut:
tJoin
 Joins two tables by doing an exact match on several columns. It compares
columns from the main flow with reference columns from the lookup flow and
outputs the main flow data and/or the rejected data.
Left Join
Note: Include lookup columns in output: This field should be selected in properties in
order to fetch fields from reference link.
 Note: To perform Right join , change main link to
reference and reference link to main .
 Input  Reference

• Output
Inner Join
 To Perform inner join using tJoin select inner join option in tjoin properties.
1. Main : inner joined records.
2. inner join reject: rejected records from Main Link(Source data)

Note: Include lookup columns in output: This field


should be selected in properties in order to fetch fields
from reference link.
Inner Join [continue..]
Sample Job
 Input

 Reference

 Inner join

 Inner Join Reject


tMap
 Core functionality of tMap is to transforming input data to output data.
tMap

 tMap allows to give temp[buffer] path for internal calculations.

 Each input allows you to specify an Expression and Filter.

 Expression and Filter option is optional in tMap.


tMap LookupModel [contd…]

1 .Load once : It loads once (and only once) all the records from the lookup flow either in the
memory or in a local file before processing each record of the main flow in case the Store temp
data option is set to true

2.Reload at each row : it loads all the records of the lookup flow for each record of the main flow.
The lookup data flow is constantly updated and you want to load the latest lookup data for each
record of the main flow to get the latest data after the join execution

3. Reload at each row (cache) : all the records of the lookup flow are loaded for each record of
the main flow .The lookup data are cached in memory, and when a new loading occurs, only the
records that are not already exist in the cache will be loaded.
tMap Match Model [Contd…]

 Performs the row matches for main flow and lookup flow in following ways :

1. All rows

2. Unique match

3. First match

4. All matches
tMap Join Model [Contd…]

 Performs the following types of joins :

1. Inner join

2. Left Outer join


tMap Store temp data [Contd…]

 Store temp data =True: Allows to provide the temporary path for
internal calculations.

 Store temp data =False: Use the default cache memory for internal
calculations.
Data Operation

 CHAR: Converts a numeric value to its ASCII character string equivalent.

 DTX: Converts a decimal integer into its hexadecimal equivalent.

 FIX:
Rounds a number of type Double to a number of type Long with the precision specified in
the PRECISION statement.

 XTD: Converts a hexadecimal string into its decimal equivalent.


Data Operation [contd…]

 Input:  output:
Mathematical

 ABS : Returns the absolute (positive) numeric value of an expression.

 FFIX : Converts a floating-point number to a string with a fixe precision.

 BITAND : Performs a bitwise AND of two integers.

 MOD : Calculates the modulo (the remainder) of two expressions.

 SADD : Adds two string numbers and returns the result as a string number.

 SCMP : Compares two string numbers


Mathematical[Contd..]

 Input:  Output :
Numeric

 convertImpliedDecimalFormat : return numbers using an implied decimal


format

 random : return a random integer between min and max


 sequence : return an incremented numeric id

 input : Output :
Relational

 IsNull : Indicates when a variable is the null value. (if incoming variable is having null value it will return true
else false).
 NOT : Returns the complement of the logical value of an expression.

 Input  Output
String Handling

 BTRIM : Deletes all blank spaces and tabs after the last nonblank
character in an expression.

 COUNT : Evaluates the number of times a substring is repeated in a string.

 DOWNCASE : Converts all uppercase letters in an expression

 EREPLACE : substitutes all substrings that match the given regular


expression in the given old string with the given replacement and returns a
new string

 FTRIM : Deletes all blank spaces and tabs up to the first nonblank
character in an expression

 INDEX : Returns the starting column position of a specified occurrence of a


particular substring within a string expression
String Handling[Contd..]

 IS_ALPHA : Determines whether the expression is an alphabetic or


nonalphabetic

 LEFT : Specifies a substring consisting of the first n characters of a string.

 LEN : Calculates the length of a string.

 RIGHT : Specifies a substring consisting of the last n characters of a string.

 SQUOTE : Encloses an expression in single quotation marks.

 STR : Generates a particular character string a specified number of times

 TRIM : Deletes extra blank spaces and tabs from a character string

 UPCASE : Converts all lowercase letters in an expression to uppercase.


String Handling[Contd..]
String Handling[Contd..]

 Input

 Output :
TalendDataGenerator
 getFirstName() : Generates random Firstname

 getLastName() : Generates random LastName

 getUsCity() : Generates random City

 getUsState() : Generates random UsState

 getUsStateId() : Generates random UsStateId

 getUsStreet() : Generates random UsStreet


TalendDataGenerator [Contd..]

 Output :
TalendDate

 addDate : add number of day, month ... to a date (with Date given in String
with a pattern)

 compareDate : compare two date

 diffDate : return difference between two dates

 formatDate : Formats a Date into a date/time string.

 getCurrentDate : return the current date

 getDate : return the current datetime with the given display format format

 getFirstDayOfMonth : get first day of the month

 getLastDayOfMonth : get last day of the month

 getPartOfDate : get part of the date like


year,month,hour,day_off_week,week_of_month,week_of_year so on
TalendDate [contd…]

 getRandomDate : return an ISO formatted random date

 isDate : test string value as a date (with right pattern)

 setDate : set a date new value partly


TalendDate[Contd..]

 Input:

 Output :
TalendString

 getAsciiRandomString : Return a randomly generated Ascii String.

 talendTrim : Returns a copy of the string, with leading and trailing specified
char omitted.
TalendString[Contd..]

 Output
Joins using tMap

 By using tMap we can perform below join type.

1. Inner Join (with Reject)

2. Left Outer Join


Inner Join using tMap
 Performing inner join in the tMap component.

 Sample Job:
Inner Join using tMap
 tMap settings:

 Output
Left outer Join using tMap
 Sample:

 Output
Full Outer Join using tMap

 Follow below steps to perform Full Outer Join.

1. Perform the inner join-Collect the reject outputs.

2. Perform the left join-collect the output.

3. Combine both the outputs from above.


Full Outer Join using tMap [Contd..]

 Output
tMap

• Core functionality of tMap is to transforming input data to output data.

• Advanced component, which integrates itself as plugin to Talend Studio


• transforms and routes data from single or multiple sources to single or multiple destinations.
• tMap can be used for joins and expand Model properties. They are 3 Models

1. Join Model
2. Match Model
3. Lookup Model

Join Model
In tMap we can perform below joins, The default Join Model is Left Outer Join
• Left outer join
• Inner Join
Match Model

The default Match Model is the curiously named Unique match. If your primary row matches multiple rows in your look-up input, then only
the last matching row will be output. The remaining options are First match, where only the first matching row will be output, and All matches where all
matching rows will be output
tMap [contd…]
LookupModel

1.Load once : It loads once (and only once) all the records from the lookup flow either in the memory or in a local file before processing
each record of the main flow in case the Store temp data option is set to true

2.Reload at each row : it loads all the records of the lookup flow for each record of the main flow. The lookup data flow is constantly
updated and you want to load the latest lookup data for each record of the main flow to get the latest data after the join execution

3. Reload at each row (cache) : all the records of the lookup flow are loaded for each record of the main flow .The lookup data are cached
in memory, and when a new loading occurs, only the records that are not already exist in the cache will be loaded.
tMap [contd…]
• Each input allows you to specify an Expression and Filter.

 Expression and Filter option is optional in tMap.


 tMap allows to give temp[buffer] path for internal calculations .

• Store temp data =True: Allows to provide the temporary path for internal calculations.

• Store temp data =False: Use the default cache memory for internal calculations.
Joining Data Source by using tMap
Joins using tMap
Joins using tMap  Sample Job:

• By using tMap we can perform below join type.

1. Inner Join (with Reject)

2. Left Outer Join

Performing inner join in the tMap component.


Inner Join using tMap
 tMap settings:

 Output
Left outer Join using tMap
 Sample:

 Output
Full Outer Join using tMap

• Follow below steps to perform Full Outer Join.

1. Perform the inner join-Collect the reject outputs.

2. Perform the left join-collect the output.

3. Combine both the outputs from above.

• Output
tMap Example

Match Model All Matches

Match Model Unique

Match Model First Row

Reject_Output
tJoin

Component Tab of tJoin

Joins two tables by doing an exact match on several columns. It


compares columns from the main flow with reference columns from
the lookup flow and outputs the main flow data and/or the rejected
data.

Difference between tMap and tJoin


tLoop

 tLoop iterates on a task execution.

 tLoop allows to automatically execute a task or a Job based on a loop.

 tLoop allows to perform 2 looping typs.

 1. For Loop

 2. while loop

1. For Loop:

 Output:
tLoop [Contd…]
2. While Loop:

OutPut:
tForEach
 tForeach creates a loop on a list for an iterate link.

 Output:
Useful Components in Talend

Launches another talend job stored in the project

tParallelize helps you manage complex Job systems. It executes several subjobs simultaneously and synchronizes the
execution of a subjob with other sub-jobs within the main Job.

tSendMail sends emails and attachments to defined recipients.


Parent and Child Job (tParallelize & tRun)

tParallelize: The tParallelize component is an Orchestration component. The tParallelize (available in Enterprise
Edition) component is used for multiple executions of the jobs at the same time. This can be achieved by multi-threading
the jobs.

tRun: Component allows you to embed one Talend Job within another so that it may be executed as a Talend SubJob.
It executes the Job called in the component's properties, in the frame of the context defined.
tRun helps mastering complex Job systems which need to execute one Job after another. By this means can be utilized
to have a parent and child relationship among the Jobs.
 tRunJob Component from the Component Palette (Orchestration) of you can drag an existing Job from Job Designs in
the Repository Browser.
Usage
This component can be used as a standalone Job or can help clarifying complex Job by avoiding having too many sub-jobs
all together in one Job.
Example job to describe tParallelize and tRun
Running jobs in parallel mode
• In order to run jobs in parallel mode we have to select Multi thread
execution option in job.

 Note: By default jobs will be executed in sequential mode.


Dynamic schema

• Dynamic schema allow you to design Jobs with an unknown column structure
(unknown name and number of columns)

• The main application of this functionality is a replication scenario or a


simple one-to-one mapping of many columns

• For example, if you need to migrate a whole database with hundreds of tables,
you can do so without explicitly including the table structure, using a single Job.
Dynamic schema supported components
tFileInputDelimited tParAccelInput tAggregateRow
tFileOutputDelimited tSQLiteInput tSortRow
tAccessInput tSasInput tFilterRow
tAS400Input tSybaseInput tWriteDynamicFields
tDBInput tVectorWiseInput tExtractDynamicFields
tDB2Input tVerticaInput tUnite
tEXAInput tVerticaOutput tUniqRow
tFirebirdInput tTeradataInput tRunJob
tGreenPlumInput tJava tReplicate
tHSQLDBInput tJavaFlex tAggregateSortedRow
tIngresInput tJavaRow tFilterColumns
tInformixInput tLogRow tJoin
tJavaDBInput tMap tSampleRow
tJDBCInput tOracleOutput tHashInput
tMaxDBInput tMysqlOutput tHashOutput
tMysqlInput tMSSqlOutput tFileInputPositional
tMSSqlInput tPostgresqlOutput tFileOutputPositional
tNetezzaInput tAS400Output tAmazonMysqlInput
tOracleInput tDB2Output tAmazonMysqlOutput
tPostgresqlInput tInformixOutput tAmazonOracleInput
tPostgresPlusInput tSybaseOutput tAmazonOracleOutput
tTeradataOutput tSAPHanaInput
tSAPHanaOutput
tLDAPInput
Custom Code Components

176
Custom Code Components

 tJava

 tJavaRow

 tJavaflex

 tLibraryLoad

 tSetGlobalVar

 tSetDynamicSchema (Enterprise)
tJava

 tJava allows to write custom java code.

 It applies exclusively to the start of the generated code of the subjob

 it will be executed first, but only once in the subjob.

 tJava has no input or output data flow and is used as a separate subjob.

 Output
tJavaRow
The tJavaRow component allows Java logic to be performed for every record within a flow.
tJavaRow is called for every row processed, it is possible to create a global variable for a row that can be referenced by all
components in a flow.

The tJavaRow code applies exclusively to the main part of the generated code of the subjob.
The Java code inserted through the tJavaRow will be executed for each row. Generally,
the tJavaRow component is used as an intermediate component and you are able to access the
input flow and transform the data.
The following use case shows a typical Job using a tJavaRow:
 A tFileInputDelimited component reads data from a text file,

 a tJavaRow component applies some transformation to the data being processed,

 then the transformed data is displayed to the console using a tLogRow component.
tJavaFlex

 tJavaFlex allows to enter personalized code in order to integrate it in Talend


program.

 tJavaFlex have 3 parts(Start code Main Code End Code)

1. Main Code: Apply on each of data flow (row Fields)

2. End Code: Closing loop, Java Code

3. Start Code: Initialization

 The start part will be executed at the beginning of the subjob, but only once.

 The main part will be executed for each row.

 Main part allows to access the input flow and modify the data.

 The source data is processed at runtime by tJavaFlex.

 The end part will be executed at the end of the subjob, but only once.
tJavaFlex - Sample Job

• Enter the Java code that helps to


import or external libraries used in
the Main code .

 Output:
Difference between tjava,tjavaRow and tjavaFlex
External JARs -- tLibraryLoad
tLibraryLoad allows you to load useable Java libraries in a Job.

Else go to WindowsPreferences
Files & XMLs
Delimited file
 Follow Below steps to import Delimited file.
Importing a Delimited File
 Follow below steps to import delimited file.
Importing a Regex File
 Follow below steps to import delimited file.
Importing an Excel File
 Follow below steps to import Excel file Metadata.
tFileList

 Iterates on files or folders of a set directory. Retrieves a set of files or folders


based on a filemask pattern and iterates on each unity.
tFileTouch

 Either creates an empty file or, if the specified file already exists, updates its
date of modification and of last access while keeping the contents unchanged.
tWaitForFile

 Iterates on the specified directory and triggers the next component when the
defined condition is met.
 This component is used to put the component connected with it in waiting
state. It then triggers that component when the defined file operation occurs in
the specified directory.
tFileExist

 Checks if a file exists or not & helps to streamline processes by automating


recurrent and tedious tasks such as checking if a file exists
tFileInputFullRow
 Reads full rows in delimited/fixed width file.
 it allows user to create own schema

 Output
tFileInputRaw
 Reads full rows in delimited/fixed width file.
 it does not allows user to create own schema

 Output
Creating an Input XML Definition
 Follow below steps to import XML file.
Creating an Output XML Definition
tExtractXMLField

 Reads an input XML field of a file or a database table and extracts desired
data.
CONTEXTS & VARIABLES

198
CONTEXTS & VARIABLES

 Variables represent values which change throughout the execution of a


program. A context is characterized by parameters.
Context/Global Context
What is Context?
Context describes the user-defined parameters that are passed to your Job at runtime. Use of context
variables enables the application to be migrated across environments without making any change to the Talend
code. The only thing that needs to be changed across environments is the value of the context variable.
Values may also change as your environment changes, for example, passwords may change from time to time.

Types of Context
• Context
• Global Context
The Global Map
• put
• get
Context Parameters & Variables
 Context variables allow jobs to be executed in different ways, with different parameters

 Variables represent values which change throughout the execution of a program. A context is characterized by
parameters.

 You can define context variables for a particular Job in two ways:

 Using the Contexts view of the Job.

 Using the F5 key from the Component view of a component.

How to define context variables in the Contexts view?

 The Contexts view is positioned among the configuration tabs below design workspace.
 If you cannot find the Contexts view on the tab system of Talend Studio, go to Window > Show view > Talend, and select Contexts.

 The Contexts tab view shows all of the variables that have been defined for each component in the current Job and context variables
imported into the current Job.
Context Parameters (Contd..)
Context Parameters (Contd..)
 From this view, you can manage your built-in variables:
 Create and manage built-in contexts.

 Create, edit and delete built-in variables.

 Reorganize the context variables.

 Add built-in context variables to the Repository.

 Import variables from a Repository context source for use in the current Job.

 Edit Repository-stored context variables and update the changes to the Repository.

 Remove imported Repository variables from the current Job.


Defining Variables
1. Click the [+] button at the bottom of the Contexts tab view to add a parameter line in the table.

2. Click in the Name field and enter the name of the variable you are creating, host in this example.

3. From the Type list, select the type of the variable corresponding to the component field where it will used, String for
the variable host in this example.

4. For different variable types, the Value field appear slightly different when you click in it and functions differently:
Defining Contexts from Contexts View

 Contexts View > Click ‘+’ available on right end > In the ‘Configure
Contexts’ popup, click on ‘New’ > enter the desired name
Defining Variables from Contexts View

 Contexts View > Click ‘+’ available at the left bottom > Fill in the fields
Name, Datatype, Value for all the Contexts as required.
Using Variables in a Job

 In the relevant Component view, place your mouse in the field you want to

parameterize and press Ctrl+Space to display a full list of all the global
variables and those context variables defined in or applied to your Job
Defining Context Variables from Component View

 On the relevant Component view, place your cursor in the field you want to

parameterize > Press F5 to display the [New Context Parameter] dialog box > Fill
the details as required
Centralizing Context Variables in the Repository
 Right-click the Contexts node in the Repository tree view and select
 Create context group from the contextual menu
Adding a built-in Context Variable to the Repository

 In the Context tab view of a Job, right-click the context variable you want to
add to the Repository and select Add to repository context from the contextual
menu to open the [Repository Content] dialog box.
Applying Repository Context Variables to a Job

 Double-click the Job to which a context group is to be added, once the Job is
opened, drop the context group of your choice either onto the Job workspace
or onto the Contexts view beneath the workspace
Running Job in a Selected Context

 Click the Run Job tab, and in the Context area, select the relevant context among
the various ones you created
Different Ways to achieve context
Via Talend Context

Via Property File

Via Put and get


SLOWLY CHANGING DIMENSIONS (SCD)

Incremental data loading concept used in data integration projects that


involve the regular extraction and transportation of a large amount of
data from one system to another system or systems.
tMysqlSCD
tMysqlSCD [cont’d…]
Windows/Unix commands in Talend
tSystem

 tSystem executes one or more system commands.


 to console: data is passed on to be viewed in the Run view.

 Output:
NESTED JOBS
Running jobs within jobs, passing values from parent to child jobs, etc.
Introduction to Nested Jobs

■ Talend Studio allows us to call a Job as part of another Job and as


well passing the values from one Job to another.
■ tRunJob, tBufferInput, tBufferOutput components can be used to
achieve this functionality.
■ In addition, Contexts Parameters can also be used to pass
values across Jobs.
■ Demo: Let’s create 3 jobs viz. Child, Parent, Final. Intent is Parent
should trigger both Child and Final from within itself, while passing the
values across Child to Parent to Final.
Nested Jobs (Demo)

■ Create a Job-1 called Child as shown below


■ EMP as Source (tMysqlInput) > Reduce Columns (tFilterColumns) > Filter Rows
(tFilterRow) > Console View (tLogRow) > Output (tBufferOutput)
Nested Jobs (Demo) [cont’d…]

■ Create a Job-3 called Final as shown below

■ Pre-requisite: Create 3 context variables viz. vUID, vName, vSal

■ Generate incoming data rows (tRowGenerator) > Derive annual


salary (tMap) > Console View (tLogRow)
Nested Jobs (Demo) [cont’d…]

■ Create a Job-2 called Parent as shown below


■ Call Child Job (tRunJob) > Deriving UID (tMap) > Console View (tLogRow)
> Call Final Job (tRunJob)
Nested Jobs (Demo) [cont’d…]

■ Running the Parent Job

■ The tLogRow components placed at each Job helps us now to preview the
data transformation flow
Logs & Errors components

The Logs & Errors family groups the component dedicated to log information
catching and job error handling.
List & most used components in Logs & Errors

 tChronometerStart & tChronometerStop

 tLogCatcher

 tDie

 tWarn

 tStatCatcher

 tLogRow
tChronometerStart

Starts measuring the time a subjob/’s takes to be executed.

tChronometerStop

• Measures the time a subjob/’s takes to be executed.

Caption : it can be empty or it can be parameterized.


Sample job:

Sample Output:
tLogCatcher

 Fetches set of fields and messages from Java Exception, tDie or tWarn
and passes them on to the next component.

 Operates as a log function triggered by Java exception, tDie or tWarn,


to collect and transfer log data.
tDie

 tDie Kills the current Job.


 tDie used with a tLogCatcher for capturing the log before killing the job.

• Die message: Enter the message to be displayed before the Job is killed.
• Error code: Enter the error code if needed.
• Priority: Set the level of priority.
• priority could be either on of Trace,debug,info,warning,error,fatal
Sample job:

Sample Output:
tWarn

 Provides a priority-rated message to the next component


 tWarn Triggers a warning often caught by the tLogCatcher component for
exhaustive log.
 Note: Cannot be used as a start component.

• Warn message: Enter the message to be displayed.


• code: Enter the error code if needed.
• Priority: Set the level of priority.
Sample job:

Sample Output:
tStatCatcher

 tStatCatcher gathers the job processing metadata at a job level as


well as at each component level.
 It Operates as a log function triggered by the StatsCatcher Statistics
check box of individual components, and collects and transfers this log
data to the output defined.

Sample job:

NOTE: tStatCatcher Statistics option is available in all


the components.
Output:
tLogRow

 Displays data or results in the Run console.


Job Version Control
Job Version Control [cont…]

 Developers can change the job version in the Studio using the minor and
major buttons so this job stays at version 0.1/1.0
 By default job version will be 0.1
 talend allows to maintain ‘n’ number of versions.
Creating new job “Test_Job” and by default version will be 1

M – indicates major version.


m – indicates minor version.
Existing job with new version 1.0

 To have versions, right click on job and select edit properties


then increase the version. Then click finish.
Follow below step to open older version of Test_Job_1.0

• 1. Right click on the job and select open another version. 2. Select the old version to open it .
Talend - Introduction
 Talend is the leading open source integration software provider to data-driven enterprises.

 It’s modern data platform and open source approach simplifies

the development process

reduces the learning curve

decreases total cost of ownership

to connect more data


 Talend connects at big data scale, 5x faster and at 1/5th the cost.
Scope
 Talend addresses all of an organization’s data integration needs:

Synchronization or replication of databases

Right-time or batch exchanges of data

ETL (Extract Transform Load) for BI or analytics

Data migration

Complex data transformation and loading

Basic data quality

Big Data.
History
 Talend was founded in 2005 by Bertrand Diard and Fabrice Bonan. Headquartered in
Redwood City, California.

 The company's first product. Talend Open Studio for Data Integration was launched in
October 2006.

 Currently Talend has more than 4000 paying customers including eBay, Sony Online
Entertainment ,Disney etc.

 Current version available is 6.2.0.

 More than 900+ built-in connectors that allow you to easily link a wide array of sources
and targets.
Talend - Architecture

 The operating principles of the Talend products could be summarized as briefly as the
following topics:
 building technical or business-related processes,

 administrating users, projects, access rights and processes and their dependencies,

 deploying and executing technical processes,

 monitoring the execution of technical processes.

 Each of the above topics can be isolated in different functional blocks and the different
types of blocks and their interoperability can be described as in the following architecture
diagram
Talend – Architecture (Contd..)
https://community.talend.com/
Prerequisites for TOS for DI
Operation System: MS Windows, or Linux Ubuntu or Redhat Linux

Memory Usage: 3GB minimum, 4GB recommended

Disk Usage: 3GB disk space required for installation, 3+ GB disk space
required for use

Environment: jdk 1.8, JAVA_HOME


Download TOS for DI from below link

https://www.talend.com/download/talend-open-studio/#t4
Software
• Download, extract file of TOS DI V7.1.1 and use it.

• Download and install postgresql-9.4.0-1-windows-x64.


Configuring Talend
 Downloading Talend
1. Download the Talend Open Studio zip file from the Talend website (http://www.talend.com/). Talend may be
installed on Windows, OSX, Unix and Linux.

2. Unzip the downloaded file and in the folder click the executable file corresponding to your operating system.

3. The below startup window appears.


Initial Setup- Create Project
• Create Project
• Step-1: Open Talend Application and click on Create a new project.
• Step-2: Enter details of the project to be created and click Finish
• Step-3: New created project will be added to the list of projects. Select project and click “Finish” to access the project
Note: You can also import the Demo project to access talend provided demo projects

Step-2
Step-1

Step-3
Talend Data Integration Studio

1. Repository: All the metadata is


accessible in this section which
Repository
Window
Design
Workspace
includes Jobs, Data Object
Definitions, Connection Objects,
Palette
Routines, etc.

2. Workspace: Job design and


development is done here.

3. Properties: Configuration of
Components, Jobs, Context
Configuration
Tabs Variables can be done in here.

4. Palette: All components and


connectors can be browsed and
selected from.
Job Designer
The jobs are defined with a graphical tool (drag & drop components : DB tables, files -Excel, CSV, XML)
and then it generates the code that you reuse in your program (Perl or Java).
SLOWLY CHANGING DIMENSIONS (SCD)

Incremental data loading concept used in data integration projects that


involve the regular extraction and transportation of a large amount of
data from one system to another system or systems.
tMysqlSCD
tMysqlSCD [cont’d…]
Windows/Unix commands in Talend
tSystem

 tSystem executes one or more system commands.


 to console: data is passed on to be viewed in the Run view.

 Output:
Logs & Errors components

The Logs & Errors family groups the component dedicated to log information
catching and job error handling.
List & most used components in Logs & Errors

 tChronometerStart & tChronometerStop

 tLogCatcher

 tDie

 tWarn

 tStatCatcher

 tLogRow
tChronometerStart

Starts measuring the time a subjob/’s takes to be executed.

tChronometerStop

• Measures the time a subjob/’s takes to be executed.

Caption : it can be empty or it can be parameterized.


Sample job:

Sample Output:
tLogCatcher

 Fetches set of fields and messages from Java Exception, tDie or tWarn
and passes them on to the next component.

 Operates as a log function triggered by Java exception, tDie or tWarn,


to collect and transfer log data.
tDie

 tDie Kills the current Job.


 tDie used with a tLogCatcher for capturing the log before killing the job.

• Die message: Enter the message to be displayed before the Job is killed.
• Error code: Enter the error code if needed.
• Priority: Set the level of priority.
• priority could be either on of Trace,debug,info,warning,error,fatal
Sample job:

Sample Output:
tWarn

 Provides a priority-rated message to the next component


 tWarn Triggers a warning often caught by the tLogCatcher component for
exhaustive log.
 Note: Cannot be used as a start component.

• Warn message: Enter the message to be displayed.


• code: Enter the error code if needed.
• Priority: Set the level of priority.
Sample job:

Sample Output:
tStatCatcher

 tStatCatcher gathers the job processing metadata at a job level as


well as at each component level.
 It Operates as a log function triggered by the StatsCatcher Statistics
check box of individual components, and collects and transfers this log
data to the output defined.

Sample job:

NOTE: tStatCatcher Statistics option is available in all


the components.
Output:
tLogRow

 Displays data or results in the Run console.


Performance Tuning Tips in Talend

1. Store on Disk Option (tMap: Store temp data)


2. Reload at each row in tMap (When your Lookup is huge and main flow is small)
3. Re-use connections(use toracleconnection component)
4. Split Talend Job to smaller Subjobs(tRunjob)
5. Parallelism(tParallelize)
6. Use Database Bulk components
7. Never do an direct update to your database if you have more records instead go for internal
updation using talend and do a one time insert.
8. Use Query for DML operationsdatabase
9. Adding Metadata to your repository
10. Context(Properties files)
11. Allocating more memory to the Jobs
Code Migration & Deployment --Exporting Jobs
• Select job in Repository > Right Click > Export Items > Enable ‘ Export Dependencies’ > Chose ‘Select archive file’ > Finish
Importing Jobs
• Repository > Select the section header Job Designs > Right Click > Import Items > Chose ‘Select archive file’ > Finish
Deploying Jobs
• To be able to trigger jobs from command line, we need to deploy the jobs first
• Once deployed, the executable .jar files are generated for us to trigger them using corresponding .bat or .sh files provided in
the build content
• Repository > Job Designs > Select Job > Right Click > Build Job > Enable ‘Extract the zip file’ > Finish
• Deployed folder contain a .bat and a .sh file. Depending on the OS of your machine, you can trigger the corresponding script
Transformation Comparison between Leading ETL Tools

279
Comparison of Transformations
Following section presents a comparison of major operational differences between different transformations of same
nature among Talend , Informatica Powercenter and Data Stage
Transformation Talend Informatica Powercenter DataStage

Aggregation Transformation Name: Transformation Name: Transformation Name:


• tAggregateRow • Aggregator • Aggregator
• tAggregateSortedRow

 Can only segregate dataset  Segregates and sorts the data set  Segregates and sorts the data set

Join Transformation Name: Transformation Name: Transformation Name:


• tJoin • Joiner • Join

 Allows only Inner and Left outer Join  Allows inner , left outer, right outer and  Allows inner , left outer, right outer and
 Allows join between two data source full outer join full outer join
 Allows join between two data source  Allows join between two data source
Lookup Transformation Name: Transformation Name:
• No separate transformation for lookup. This • Lookup
activity is performed by multi-purpose
transformation tMap on Talend

 Allows reference to another data source


( connected and unconnected)
Comparison of Transformations continue..
Transformation Talend Informatica Powercenter DataStage

Pivot ( Columns to Transformation Name: Transformation Name:


Rows ) • tNormalize • Normalizer

 Captures and concatenates data into single  Pivots Columns to Rows


field and Pivots
 Pivots Columns to Rows
Pivot (Rows to Transformation Name: Transformation Name:
Columns) • tDeNormalize • None

 Captures and concatenates data into single


field and Pivots
 Pivots Rows to columns
Filter Transformation Name : Transformation Name :
• tFilter • Filter

 Filters data based on condition  Filters data based on condition


 Allows capture separate flow for reject  Can not capture rejected records
records
Router Transformation Name : Transformation Name :
• No Separate Transformation • Router
• tMap can achieve this
 Filters and groups data based on condition
 Filters data based on condition  Allows capture rejected records(Default)
 Allows capture separate flow for reject
records
Comparison of Transformations continue..
Transformation Talend Informatica Powercenter DataStage

*Multiple Transformation Name: Transformation Name:


Transformations • tMap • Expression
• Joiner
Expression * This one transformation on Talend can performs tasks of • Lookup
Lookup multiple transformations on Informatica / Data stage • Filter
Joiner • Router
Filter
 Expression (Operation) :  Expression :
Router
 Allows data transformation one row at a  Allows data transformation one
time row at a time
 Lookup (Operation) :  Lookup :
 Connected flow  Both connected and Unconnected
 No Unconnected Lookup flow
 No Dynamic Lookup  Dynamic Lookup
 No Persistent cache  Persistent cache
 Allows un-cached lookup  No un-cached lookup
 Join(Operation) :  Joiner :
 Allows multiple data source join at same  Allows join between only two data
time [ Ps: tJoin transformation allows join sources
between two data source ]  No immediate data visualization
 Immediate data visualization possible for
Inner Join failure records
 Filter (operation) :  Filter :
 Router(Operation) :
THANK YOU

 Guided by:
Subramanyam Korlakunta

283

You might also like