Professional Documents
Culture Documents
Subramanyam K
Agenda
Talend is a company focused on Data Integration and Data Management solutions. Talend is a “Cool
Vendor“ for Gartner (2010).Present in more than 12 locations around the World and it is a fast growing
company.
Talend Open Studio for Data Integration is an open source data integration product developed by Talend.
Data Integration involves combining data residing in different sources and providing the user with a
unified view of the data.
Introduction to ETL
Extraction Transformation Loading (ETL)
Data integration refers to the technical and business processes used to combine data from multiple
sources to provide a unified, single view of the data.
Talend Offerings
Talend Offerings / Talend Unified Platform
Talend Unified Platform
Talend Architecture
Talend Data Integration - Architecture
Functional Architecture
How it works?
■ Talend is a code generator ETL which use JAVA as the underline technology generated to
perform the data Extraction, Transformation and Loading.
■ For database connectivity: Talend uses ODBC and JDBC drivers delivered and certified by the
database vendors themselves (Oracle, DB2 IBM, Teradata, SQL Server, MySQL, etc... )
■ Implements the BULK LOADER of those Database vendors following their API or executables.
■ For Files connectivity: regarding the format, Talend leverages different library underneath. File
delimited is straight forward, but for XML or JSON it uses the proper lib for that XPathQuery, etc.
■ For CRM, ERP and other business application: Talend integrates through Web Services API
provided by the business application vendors
■ For SAP: Talend reuses the JCO connector/lib which is certified and provided by SAP
themselves, to reach out RFC, BAPI functions; Talend also provides a iDoc connector for SAP,
etc.
■ Talend also provides connector and protocol support for : FTP (FTPS or SFTP), SCP, SOAP,
REST, WEBSERVICE, RSS, LDAP etc.
TALEND STUDIO
Talend System Requirement
The following are the system requirements to download and work on Talend Open Studio −
DB connection(Relational)
JDBC schema
SAS connection,
file schema
FTP connection
LDAP schema
salesforce schema
Generic schema
MDM connection
WSDL schema
Initial Setup – Create File MetaData
tFileInputDelimited properties
• File Name/Stream - Name and path of the file to be processed.
o tLogRow properties
• Basic - Displays the output flow in basic mode.
• Table - Displays the output flow in table cells
• Separator - Enter the separator which will delimit data on the Log display.
Creating a first Job
tMap transforms and routes data from single or multiple sources to single or
multiple destinations.
Talend Job Creation continue..
Talend Job Creation completed.
Create First Job
Step-1 : On Job Design palate , right click and “Create a job”
Step-2 :Drag Connection ( as applicable) from Metadata pallaete , and choose Input / Output as applicable
Step 3 : Right click on source ( Main ) and drag till target
Step 4 : Update Source target location as applicable
Click Run
COMPONENTS
57
COMPONENTS
PALLETE VIEW
What is a component?
Over 40 RDBMS are supported. Other types of database connectors, such as connectors
for Appliance/DW databases and database management.
There are about 346 components under the Database component category.
Database Component (Contd..)
Databases - Traditional components
tAccessBulkExec tDB2Commit tInformixSP tMSSqlTableList tOleDbRow tPostgresqlOutput
tMemSQLClose -
tAccessClose tDB2Connection Databases tMysqlBulkExec tOracleBulkExec tPostgresqlOutputBulk
tMemSQLConnection - tPostgresqlOutputBulkEx
tAccessCommit tDB2Input Databases tMysqlClose tOracleClose ec
tMemSQLInput -
tAccessConnection tDB2Output Databases tMysqlColumnList tOracleCommit tPostgresqlRollback
tMemSQLOutput -
tAccessInput tDB2Rollback Databases tMysqlCommit tOracleConnection tPostgresqlRow
tMemSQLRow - tPostgresqlSCD -
tAccessOutput tDB2Row Databases tMysqlConnection tOracleInput Databases
tPostgresqlSCDELT -
tAccessOutputBulk tDB2SCD - Databases tMSSqlBulkExec tMysqlInput tOracleOutput Databases
This mode supports all of the most popular databases including Teradata, Oracle, Vertica,
Netezza, Sybase, etc
In ELT data is migrated in bulk according to the data set and the transformation process
occurs after the data has been loaded into the targeted DBMS in its raw format.
For example: As SQL is less powerful than Java, the scope of available data transformations
is limited. ELT requires users that have high proficiency in SQL tuning and DBMS tuning.
ELT (Contd..)
Types of ELT Components
OUTPUT
INPUT tFileOutputDelimited
tFileInputDelimited
tFileInputExcel tFileOutputExcel
tFileInputXML
tFileOutputXML
tFileInputProperties
tFileOutputProperties
tFileInputRegex
File Input Components
tFileInputDelimited: Reads a given file row by row with simple separated fields.
tFileInputExcel: Reads an Excel file (.xls or .xlsx) and extracts data line by line
tFileInputXML: Reads an XML structured file and extracts data row by row.
tFileInputMSXML: Reads and outputs multiple schema within an XML structured file.
tFileInputDelimited
tFileInputDelimited reads a given file row by row with simple separated fields. to split them up
into fields then sends fields as defined in the schema to the next Job component, via a row link.
tFileInputDelimited properties
• File Name/Stream - Name and path of the file to be processed.
tFileOutputMSDelimited: Writes a file row by row based on the multiple schema and
pattern in a delimited file.
tFileOutputExcel: Writes the cells in MS Excel file row by row with separated data value
according to a defined schema.
tFileOutputXML: Writes an XML file with separated data values according to an XML tree
structure. XML structure is created from rows broken into fields.
tFileOutputDelimited
• tFileOutputDelimited writes a delimited file that holds data organized according to the defined schema. It is
an outputs data to a delimited file.
• tFileOutputDelimited properties
• File name - File name with path
• Row Separator - “\n”
• Field Separator - “,”
• Append - check box.
• Include Header - check box.
• Compress as zip file - check box.
File management Components
There are various components that helps in implementing operations(Delete, compare , copy, archive
etc.) on day-to-day files.
tFileArchive: This component creates a new zip, gzip , or tar.gz archive file from one or more
specified files or folders using different compression method.
tFileList: Retrieves a set of files or folders based on a filemask pattern and iterates on each set of
files or folders.
tFileCompare: Compares two files and provides comparison data. Helps at controlling the data
quality of files being processed.
tFileCopy: This component copies a source file or folder into a target directory.
tFileDelete
tFileCopy
tFileList
tFileRowCount
tFileUnarchive
tFileExist tFileCompare
Introduction to various key components
tMap: Allows joins, filtering, transformations and routes data from single or multiple sources to single or multiple targets.
Function: tMap is an advanced component, which integrates itself as plugin to Talend Studio.
Purpose: tMap transforms and routes data from single or multiple sources to single or multiple destinations.
Usage: Possible uses are from a simple reorganization of fields to the most complex Jobs of data multiplexing or demultiplexing
transformation, concatenation, inversion, filtering and more...
tJoin: Performs inner or left outer join between the main data flow and a lookup flow.
Function: tJoin joins two tables by doing an exact match on several columns. It compares columns from the
main flow with reference columns from the lookup flow and outputs the main flow data and/or the rejected data.
Purpose: This component helps you ensure the data quality of any source data against a reference data source.
Usage: This component is not startable and it requires two input components and one or more output component.
tSortRow: tSortRow sorts input data based on one or several columns, by sort type and order. It allows multi- column Sorting
on input data providing advanced sorting capabilities ( asc/desc sorting, alphabetical sorting). tSortRow component belongs to the
Processing family.
tFilterRow: Filters input rows by setting conditions on the selected columns.
Function : tFilterRow filters input rows by setting one or more conditions on the selected columns.
Purpose: tFilterRow helps parametrizing filters on the source data.
Usage: This component is not startable (green background) and it requires an output component.
Introduction to various key components
tConvertType: Allows specific conversions at run time from one Talend java type to another
type.
tAggregateRow: Receives a flow and aggregates it based on one or more columns. For each
output line, are provided the aggregation key and the relevant result of set operations (min, max,
sum,Avg,Count, etc...).
tUniqRow: Makes a data flow unique based on the key set on the schema.
tRunjob: tRunJob belongs to two component families: System and Orchestration. For more information on tRunJob
Funcation: tRunJob executes the Job called in the component's properties, in the frame of the context defined.
Purpose: tRunJob helps mastering complex Job systems which need to execute one Job after another.
Usage: This component can be used as a standalone Job or can help clarifying complex Job by avoiding having too many sub-jobs all together
in one Job.
If you want to create a reusable group of components to be inserted in several Jobs or several times in the same Job, you can use a Joblet.
Unlike the tRunJob, the Joblet uses the context variables of the Job in which it is inserted. For more information on Joblets, see Talend Studio User Guide.
Note : This component also allows you to call a Job of a different framework, such as a Spark Batch Job or a Spark Streaming Job.
Connection Types
There are various types of connections which define either the data to be processed, the data output, or the Job logical sequence
Row
A Row connection handles the actual data. The Row connections can be main, lookup, reject or output according to the nature of the flow
processed.
- Main
This type of row connection is the most commonly used connection. It passes on data flows from one component to the other, iterating on each
row and reading input data according to the component properties setting (schema).
-Lookup
multiple input flows.
- Output
Trigger
Trigger connections define the processing sequence, i.e. no data is handled through these connections.
Triggers
• Sub Job Triggers
• Run If Triggers
• On Component Triggers
Iterator
82
Files & XMLs
Delimited file
Follow Below steps to import Delimited file.
Importing a Delimited File
Follow below steps to import delimited file.
Importing a Regex File
Follow below steps to import delimited file.
Importing an Excel File
Follow below steps to import Excel file Metadata.
tFileList
Either creates an empty file or, if the specified file already exists, updates its
date of modification and of last access while keeping the contents unchanged.
tWaitForFile
Iterates on the specified directory and triggers the next component when the
defined condition is met.
This component is used to put the component connected with it in waiting
state. It then triggers that component when the defined file operation occurs in
the specified directory.
tFileExist
Output
tFileInputRaw
Reads full rows in delimited/fixed width file.
it does not allows user to create own schema
Output
Creating an Input XML Definition
Follow below steps to import XML file.
Creating an Output XML Definition
tExtractXMLField
Reads an input XML field of a file or a database table and extracts desired
data.
Data Processing Components
97
Introduction to various key components
• tMap: Allows joins, filtering, transformations and routes data from single or
multiple sources to single or multiple targets.
• tJoin: Performs inner or left outer join between the main data flow and a
lookup flow.
• tSortRow: Allows multi-column Sorting on input data providing advanced
sorting capabilities(asc/desc sorting, alphabetical sorting).
• tFilterRow: Filters input rows by setting conditions on the selected columns.
• tConvertType: Allows specific conversions at run time from one Talend java
type to another type.
• tAggregateRow: Receives a flow and aggregates it based on one or more
columns. For each output line, are provided the aggregation key and the
relevant result of set operations (min, max, sum...).
Introduction to various key components continue
• tUniqRow: Makes a data flow unique based on the key set on the schema.
• tUnite: Merges multiple inputs to single output.
• tReplicate: Create multiple out put sets from single input set.
• tRowGenerator: To generates sample data.
• tDenormalize: Concatenates different fields into an array or a delimited string.
• tNormalize: Normalizes a flat row to multiple rows.
• tLogRow displays data or results in the Run console,it is used to monitor data processed.
• tParallelize: The tParallelize component is an Orchestration component. The tParallelize
(available in Enterprise Edition) component is used for multiple executions of the jobs at
the same time. This can be achieved by multi-threading the jobs.
• tRun: tRun component allows you to embed one Talend Job within another so that it
may be executed as a Talend SubJob.
• tJava, tJavaRow, tJavaFlex
tReplicate
Duplicate the incoming schema into two identical output flows. Allows you to perform different operations on the
same schema.
Usage
This component is not startable (green background), it requires an Input component and an output component.
tSplitRow
tSplitRow splits one row into several rows. This component helps splitting one input row into several output rows.
Usage
This component splits one input row into multiple output rows by mapping input columns onto output columns.
tFlowToIterate
tFlowToIterate transforms a data flow into a list. Allows you to transform a processable flow into non processable data.
Usage
You cannot use this component as a start component. tFlowToIterate requires an output component.
tIterateToFlow
tIterateToFlow transforms a list into a data flow that can be processed. Allows you to transform non processable data
into a processable flow.
Usage
This component is not startable (green background) and it requires an output component.
tFilterRow
• tFilterRow filters input rows by setting one or more conditions on the selected columns. It helps parametrizing
filters on the source data. It has a single input link and output link as well as optionally it can have a reject link
which can be used to capture the unmatched data.
Sample Job:
• Output :
tFilterColumns
• tFilterColumns makes helps homogenize schemas either on the columns order or by removing unwanted
columns or adding new columns.
• Makes specified changes to the schema defined, based on column name mapping.
Out Put
Sample Job:
tSortRow
Funcation: Sorts input data based on one or several columns, by sort type and order
Purpose: Helps creating metrics and classification table.
Usage: This component handles flow of data therefore it requires input and output, hence is
defined as an intermediary step.
Output:
tExternalSortRow
Funcation: Uses an external sort application to sort input data based on one or several columns,
by sort type and order
Purpose: Helps creating metrics and classification table.
Usage: This component handles flow of data therefore it requires input and output, hence is
defined as an intermediary step.
Output:
tAggregateRow
Output:
tAggregateSortedRow
Function: tAggregateSortedRow aggregates the sorted input data for output column based on a set of
operations.
Purpose : Each output column is configured with many rows as required, the operations to be carried out
and the input column from which the data will be taken for better data aggregation.
Receives a sorted flow and aggregates it based on one or more columns. For each output line, are
provided the aggregation key and the relevant result of set operations (min, max, sum...).
Input rows count : Specify the number of rows that are sent to the tAggregateSortedRow component.
Note : If you specified a Limit for the number of rows to be processed in the input component, you will
have to use that same limit in the Input rows count field.
tAggregateSortedRow continue..
• tAggregateSortedRow helps to provide a set of metrics based on values or calculations. As the input flow is
meant to be sorted already, the performance are hence greatly optimized.
• Receives a sorted flow and aggregates it based on one or more columns. For each output line, are provided
the aggregation key and the relevant result of set operations (min, max, sum...).
Output:
tAggregateSortedRow
Receives a sorted flow and aggregates it based on one or more columns. For
each output line, are provided the aggregation key and the relevant result of
set operations (min, max, sum...).
Output:
tNormalize
• Normalize the denormalized data.
• Function: Normalizes the input flow following SQL standard.
• Purpose: tNormalize helps improve data quality and thus eases the data update.
• How to split a multi valued attributes column into individual rows using tNormalize ?
Input
Out put
Usage
This component can be used as
intermediate step in a data flow.
tDenormalize
• Function: Denormalizes the input flow based on key/One column
• Purpose: tDenormalize helps synthesize the input flow.
• Usage: This component can be used as intermediate step in a data flow.
• How to combine multiple records to single records using tDenormalize
Input
Out Put:
tDenormalizeSortedRow
Combines in a group all input sorted rows. Distinct values of the denormalized sorted row are joined with item separators
Funcation: tDenormalizeSortedRow combines in a group all input sorted rows. Distinct values of the denormalized sorted row are joined with
item separators.
Purpose: tDenormalizeSortedRow helps synthesizing sorted input flow to save memory.
InputRowCount: Enter the number of input rows.
To denormalize: Enter the name of the column to denormalize.
Usage: This component handles flows of data therefore it requires input and output components.
Out Put:
tExtractDelimitedFields
Function: tExtractDelimitedFields generates multiple columns from a given string column in a delimited file.
Purpose: tExtractDelimitedFields helps to extract 'fields' from within a string to write them else where for example.
Usage: This component handles flow of data therefore it requires input and output components. It allows you to
extract data from a delimited field, using a Row > Main link, and enables you to create a reject flow filtering data which
type does not match the defined type.
Out Put:
tUniqRow – Data quality component
Function: Compares entries and sorts out duplicate entries from the input flow.
Purpose: Ensures data quality of input or output flow in a Job.
Usage: This component handles flow of data therefore it requires input and output, hence is defined
as an intermediary step.
Output:
Merge component
tUnite
Out Put:
tConvertType
• tConvertType allows specific conversions at runtime from one Talend java type to another.
• it helps to automatically convert one talend java type to another and thus avoid compiling errors.
• Output:
tWaitForFile
Function: tWaitForFile iterates on the specified directory and triggers the next component when the defined condition is met.
Purpose: This component is used to put the component connected with it in waiting state. It then triggers that component
when the defined file operation occurs in the specified directory.
Usage : This component plays the role of triggering the next component based on the defined condition. Therefore this
component requires another component to be connected to it via a link.
tReplace
Search the character/pattern/ word and then replace with the given pattern.
Function: Carries out a Search & Replace operation in the input columns defined.
Purpose: Helps to cleanse all files before further processing.
Usage: This component is not startable as it requires an input flow. And it requires an output component.
Input
OutPut:
tReplaceList -- Data Quality component
Search the character/pattern/ word and then replace with the given pattern.
Function: Carries out a Search and Replace operation in the input columns defined based on an external lookup.
Purpose: Helps to cleanse all files before further processing.
Usage: tReplaceList is an intermediary component. It requires an input flow and an output component.
tReplicate
Function: Validates all input rows against a reference schema or check types, nullability, length of rows against reference
values. The validation can be carried out in full or partly.
Purpose: Helps to ensure the data quality of any source data against a reference data source.
Usage: This component is an intermediary step in the flow allowing to exclude from the main flow the non-compliant data.
This component cannot be a start component as it requires an input flow. It also requires at least one output component to
gather the validated flow, and possibly a second output component for rejected data using Rejects link.
tSampleRow
OutPut:
tJoin
Joins two tables by doing an exact match on several columns. It compares
columns from the main flow with reference columns from the lookup flow and
outputs the main flow data and/or the rejected data.
Left Join
Note: Include lookup columns in output: This field should be selected in properties in
order to fetch fields from reference link.
Note: To perform Right join , change main link to
reference and reference link to main .
Input Reference
• Output
Inner Join
To Perform inner join using tJoin select inner join option in tjoin properties.
1. Main : inner joined records.
2. inner join reject: rejected records from Main Link(Source data)
Reference
Inner join
1 .Load once : It loads once (and only once) all the records from the lookup flow either in the
memory or in a local file before processing each record of the main flow in case the Store temp
data option is set to true
2.Reload at each row : it loads all the records of the lookup flow for each record of the main flow.
The lookup data flow is constantly updated and you want to load the latest lookup data for each
record of the main flow to get the latest data after the join execution
3. Reload at each row (cache) : all the records of the lookup flow are loaded for each record of
the main flow .The lookup data are cached in memory, and when a new loading occurs, only the
records that are not already exist in the cache will be loaded.
tMap Match Model [Contd…]
Performs the row matches for main flow and lookup flow in following ways :
1. All rows
2. Unique match
3. First match
4. All matches
tMap Join Model [Contd…]
1. Inner join
Store temp data =True: Allows to provide the temporary path for
internal calculations.
Store temp data =False: Use the default cache memory for internal
calculations.
Data Operation
FIX:
Rounds a number of type Double to a number of type Long with the precision specified in
the PRECISION statement.
Input: output:
Mathematical
SADD : Adds two string numbers and returns the result as a string number.
Input: Output :
Numeric
input : Output :
Relational
IsNull : Indicates when a variable is the null value. (if incoming variable is having null value it will return true
else false).
NOT : Returns the complement of the logical value of an expression.
Input Output
String Handling
BTRIM : Deletes all blank spaces and tabs after the last nonblank
character in an expression.
FTRIM : Deletes all blank spaces and tabs up to the first nonblank
character in an expression
TRIM : Deletes extra blank spaces and tabs from a character string
Input
Output :
TalendDataGenerator
getFirstName() : Generates random Firstname
Output :
TalendDate
addDate : add number of day, month ... to a date (with Date given in String
with a pattern)
getDate : return the current datetime with the given display format format
Input:
Output :
TalendString
talendTrim : Returns a copy of the string, with leading and trailing specified
char omitted.
TalendString[Contd..]
Output
Joins using tMap
Sample Job:
Inner Join using tMap
tMap settings:
Output
Left outer Join using tMap
Sample:
Output
Full Outer Join using tMap
Output
tMap
1. Join Model
2. Match Model
3. Lookup Model
Join Model
In tMap we can perform below joins, The default Join Model is Left Outer Join
• Left outer join
• Inner Join
Match Model
The default Match Model is the curiously named Unique match. If your primary row matches multiple rows in your look-up input, then only
the last matching row will be output. The remaining options are First match, where only the first matching row will be output, and All matches where all
matching rows will be output
tMap [contd…]
LookupModel
1.Load once : It loads once (and only once) all the records from the lookup flow either in the memory or in a local file before processing
each record of the main flow in case the Store temp data option is set to true
2.Reload at each row : it loads all the records of the lookup flow for each record of the main flow. The lookup data flow is constantly
updated and you want to load the latest lookup data for each record of the main flow to get the latest data after the join execution
3. Reload at each row (cache) : all the records of the lookup flow are loaded for each record of the main flow .The lookup data are cached
in memory, and when a new loading occurs, only the records that are not already exist in the cache will be loaded.
tMap [contd…]
• Each input allows you to specify an Expression and Filter.
• Store temp data =True: Allows to provide the temporary path for internal calculations.
• Store temp data =False: Use the default cache memory for internal calculations.
Joining Data Source by using tMap
Joins using tMap
Joins using tMap Sample Job:
Output
Left outer Join using tMap
Sample:
Output
Full Outer Join using tMap
• Output
tMap Example
Reject_Output
tJoin
1. For Loop
2. while loop
1. For Loop:
Output:
tLoop [Contd…]
2. While Loop:
OutPut:
tForEach
tForeach creates a loop on a list for an iterate link.
Output:
Useful Components in Talend
tParallelize helps you manage complex Job systems. It executes several subjobs simultaneously and synchronizes the
execution of a subjob with other sub-jobs within the main Job.
tParallelize: The tParallelize component is an Orchestration component. The tParallelize (available in Enterprise
Edition) component is used for multiple executions of the jobs at the same time. This can be achieved by multi-threading
the jobs.
tRun: Component allows you to embed one Talend Job within another so that it may be executed as a Talend SubJob.
It executes the Job called in the component's properties, in the frame of the context defined.
tRun helps mastering complex Job systems which need to execute one Job after another. By this means can be utilized
to have a parent and child relationship among the Jobs.
tRunJob Component from the Component Palette (Orchestration) of you can drag an existing Job from Job Designs in
the Repository Browser.
Usage
This component can be used as a standalone Job or can help clarifying complex Job by avoiding having too many sub-jobs
all together in one Job.
Example job to describe tParallelize and tRun
Running jobs in parallel mode
• In order to run jobs in parallel mode we have to select Multi thread
execution option in job.
• Dynamic schema allow you to design Jobs with an unknown column structure
(unknown name and number of columns)
• For example, if you need to migrate a whole database with hundreds of tables,
you can do so without explicitly including the table structure, using a single Job.
Dynamic schema supported components
tFileInputDelimited tParAccelInput tAggregateRow
tFileOutputDelimited tSQLiteInput tSortRow
tAccessInput tSasInput tFilterRow
tAS400Input tSybaseInput tWriteDynamicFields
tDBInput tVectorWiseInput tExtractDynamicFields
tDB2Input tVerticaInput tUnite
tEXAInput tVerticaOutput tUniqRow
tFirebirdInput tTeradataInput tRunJob
tGreenPlumInput tJava tReplicate
tHSQLDBInput tJavaFlex tAggregateSortedRow
tIngresInput tJavaRow tFilterColumns
tInformixInput tLogRow tJoin
tJavaDBInput tMap tSampleRow
tJDBCInput tOracleOutput tHashInput
tMaxDBInput tMysqlOutput tHashOutput
tMysqlInput tMSSqlOutput tFileInputPositional
tMSSqlInput tPostgresqlOutput tFileOutputPositional
tNetezzaInput tAS400Output tAmazonMysqlInput
tOracleInput tDB2Output tAmazonMysqlOutput
tPostgresqlInput tInformixOutput tAmazonOracleInput
tPostgresPlusInput tSybaseOutput tAmazonOracleOutput
tTeradataOutput tSAPHanaInput
tSAPHanaOutput
tLDAPInput
Custom Code Components
176
Custom Code Components
tJava
tJavaRow
tJavaflex
tLibraryLoad
tSetGlobalVar
tSetDynamicSchema (Enterprise)
tJava
tJava has no input or output data flow and is used as a separate subjob.
Output
tJavaRow
The tJavaRow component allows Java logic to be performed for every record within a flow.
tJavaRow is called for every row processed, it is possible to create a global variable for a row that can be referenced by all
components in a flow.
The tJavaRow code applies exclusively to the main part of the generated code of the subjob.
The Java code inserted through the tJavaRow will be executed for each row. Generally,
the tJavaRow component is used as an intermediate component and you are able to access the
input flow and transform the data.
The following use case shows a typical Job using a tJavaRow:
A tFileInputDelimited component reads data from a text file,
then the transformed data is displayed to the console using a tLogRow component.
tJavaFlex
The start part will be executed at the beginning of the subjob, but only once.
Main part allows to access the input flow and modify the data.
The end part will be executed at the end of the subjob, but only once.
tJavaFlex - Sample Job
Output:
Difference between tjava,tjavaRow and tjavaFlex
External JARs -- tLibraryLoad
tLibraryLoad allows you to load useable Java libraries in a Job.
Else go to WindowsPreferences
Files & XMLs
Delimited file
Follow Below steps to import Delimited file.
Importing a Delimited File
Follow below steps to import delimited file.
Importing a Regex File
Follow below steps to import delimited file.
Importing an Excel File
Follow below steps to import Excel file Metadata.
tFileList
Either creates an empty file or, if the specified file already exists, updates its
date of modification and of last access while keeping the contents unchanged.
tWaitForFile
Iterates on the specified directory and triggers the next component when the
defined condition is met.
This component is used to put the component connected with it in waiting
state. It then triggers that component when the defined file operation occurs in
the specified directory.
tFileExist
Output
tFileInputRaw
Reads full rows in delimited/fixed width file.
it does not allows user to create own schema
Output
Creating an Input XML Definition
Follow below steps to import XML file.
Creating an Output XML Definition
tExtractXMLField
Reads an input XML field of a file or a database table and extracts desired
data.
CONTEXTS & VARIABLES
198
CONTEXTS & VARIABLES
Types of Context
• Context
• Global Context
The Global Map
• put
• get
Context Parameters & Variables
Context variables allow jobs to be executed in different ways, with different parameters
Variables represent values which change throughout the execution of a program. A context is characterized by
parameters.
You can define context variables for a particular Job in two ways:
The Contexts view is positioned among the configuration tabs below design workspace.
If you cannot find the Contexts view on the tab system of Talend Studio, go to Window > Show view > Talend, and select Contexts.
The Contexts tab view shows all of the variables that have been defined for each component in the current Job and context variables
imported into the current Job.
Context Parameters (Contd..)
Context Parameters (Contd..)
From this view, you can manage your built-in variables:
Create and manage built-in contexts.
Import variables from a Repository context source for use in the current Job.
Edit Repository-stored context variables and update the changes to the Repository.
2. Click in the Name field and enter the name of the variable you are creating, host in this example.
3. From the Type list, select the type of the variable corresponding to the component field where it will used, String for
the variable host in this example.
4. For different variable types, the Value field appear slightly different when you click in it and functions differently:
Defining Contexts from Contexts View
Contexts View > Click ‘+’ available on right end > In the ‘Configure
Contexts’ popup, click on ‘New’ > enter the desired name
Defining Variables from Contexts View
Contexts View > Click ‘+’ available at the left bottom > Fill in the fields
Name, Datatype, Value for all the Contexts as required.
Using Variables in a Job
In the relevant Component view, place your mouse in the field you want to
parameterize and press Ctrl+Space to display a full list of all the global
variables and those context variables defined in or applied to your Job
Defining Context Variables from Component View
On the relevant Component view, place your cursor in the field you want to
parameterize > Press F5 to display the [New Context Parameter] dialog box > Fill
the details as required
Centralizing Context Variables in the Repository
Right-click the Contexts node in the Repository tree view and select
Create context group from the contextual menu
Adding a built-in Context Variable to the Repository
In the Context tab view of a Job, right-click the context variable you want to
add to the Repository and select Add to repository context from the contextual
menu to open the [Repository Content] dialog box.
Applying Repository Context Variables to a Job
Double-click the Job to which a context group is to be added, once the Job is
opened, drop the context group of your choice either onto the Job workspace
or onto the Contexts view beneath the workspace
Running Job in a Selected Context
Click the Run Job tab, and in the Context area, select the relevant context among
the various ones you created
Different Ways to achieve context
Via Talend Context
Output:
NESTED JOBS
Running jobs within jobs, passing values from parent to child jobs, etc.
Introduction to Nested Jobs
■ The tLogRow components placed at each Job helps us now to preview the
data transformation flow
Logs & Errors components
The Logs & Errors family groups the component dedicated to log information
catching and job error handling.
List & most used components in Logs & Errors
tLogCatcher
tDie
tWarn
tStatCatcher
tLogRow
tChronometerStart
tChronometerStop
Sample Output:
tLogCatcher
Fetches set of fields and messages from Java Exception, tDie or tWarn
and passes them on to the next component.
• Die message: Enter the message to be displayed before the Job is killed.
• Error code: Enter the error code if needed.
• Priority: Set the level of priority.
• priority could be either on of Trace,debug,info,warning,error,fatal
Sample job:
Sample Output:
tWarn
Sample Output:
tStatCatcher
Sample job:
Developers can change the job version in the Studio using the minor and
major buttons so this job stays at version 0.1/1.0
By default job version will be 0.1
talend allows to maintain ‘n’ number of versions.
Creating new job “Test_Job” and by default version will be 1
• 1. Right click on the job and select open another version. 2. Select the old version to open it .
Talend - Introduction
Talend is the leading open source integration software provider to data-driven enterprises.
Data migration
Big Data.
History
Talend was founded in 2005 by Bertrand Diard and Fabrice Bonan. Headquartered in
Redwood City, California.
The company's first product. Talend Open Studio for Data Integration was launched in
October 2006.
Currently Talend has more than 4000 paying customers including eBay, Sony Online
Entertainment ,Disney etc.
More than 900+ built-in connectors that allow you to easily link a wide array of sources
and targets.
Talend - Architecture
The operating principles of the Talend products could be summarized as briefly as the
following topics:
building technical or business-related processes,
administrating users, projects, access rights and processes and their dependencies,
Each of the above topics can be isolated in different functional blocks and the different
types of blocks and their interoperability can be described as in the following architecture
diagram
Talend – Architecture (Contd..)
https://community.talend.com/
Prerequisites for TOS for DI
Operation System: MS Windows, or Linux Ubuntu or Redhat Linux
Disk Usage: 3GB disk space required for installation, 3+ GB disk space
required for use
https://www.talend.com/download/talend-open-studio/#t4
Software
• Download, extract file of TOS DI V7.1.1 and use it.
2. Unzip the downloaded file and in the folder click the executable file corresponding to your operating system.
Step-2
Step-1
Step-3
Talend Data Integration Studio
3. Properties: Configuration of
Components, Jobs, Context
Configuration
Tabs Variables can be done in here.
Output:
Logs & Errors components
The Logs & Errors family groups the component dedicated to log information
catching and job error handling.
List & most used components in Logs & Errors
tLogCatcher
tDie
tWarn
tStatCatcher
tLogRow
tChronometerStart
tChronometerStop
Sample Output:
tLogCatcher
Fetches set of fields and messages from Java Exception, tDie or tWarn
and passes them on to the next component.
• Die message: Enter the message to be displayed before the Job is killed.
• Error code: Enter the error code if needed.
• Priority: Set the level of priority.
• priority could be either on of Trace,debug,info,warning,error,fatal
Sample job:
Sample Output:
tWarn
Sample Output:
tStatCatcher
Sample job:
279
Comparison of Transformations
Following section presents a comparison of major operational differences between different transformations of same
nature among Talend , Informatica Powercenter and Data Stage
Transformation Talend Informatica Powercenter DataStage
Can only segregate dataset Segregates and sorts the data set Segregates and sorts the data set
Allows only Inner and Left outer Join Allows inner , left outer, right outer and Allows inner , left outer, right outer and
Allows join between two data source full outer join full outer join
Allows join between two data source Allows join between two data source
Lookup Transformation Name: Transformation Name:
• No separate transformation for lookup. This • Lookup
activity is performed by multi-purpose
transformation tMap on Talend
Guided by:
Subramanyam Korlakunta
283