You are on page 1of 35

Introduction to ETL and Talend Open Studio

Jenny Elizabeth Abella Sánchez


Computer Science Engineer
MBA - BI & Big Data
Class 1
Introduction to the ETL dataflow

• Extraction

• Transformation Transformation

Load
• Load

Extraction
Introduction to the ETL dataflow

Transformation
ELT

Less red stress


Higher performance
d
a
Lo

Extraction
Standards and Transformations

Jobs design patterns

 Create sequences from top to bottom, leaving a place to the right for data
outputs and to connect with error control components or messages.

 A Job with too many components can be difficult to understand and maintain.

 It is preferable to create jobs that control the sequence of other jobs using the
tRunJob component.
Standards and Transformations

Jobs design patterns

 Document the job with a label or comment that indicates in general terms
what you do, the objective of the transformations, hopefully include its value
in terms of business meaning.

 Name the jobs as short as possible but indicating in the name what they
mainly do either using a very representative word of what they solve or
including words of the process to which they belong with some documented
nomenclature of jobs.
Standards and Transformations
Standard transformation
Scalability and Performance
• TOS_DI-win32-x86.ini

• Parameters to define the memory of the JVM:

• -Xms: JAVA_MIN_MEM
• Assign by default when a Job starts
• -Xmx: JAVA_MAX_MEM
• To avoid exceptions "Out of memory"
Work Environment
Work Environment
Work Environment
Work Environment
Work Environment
Work Environment
Workshop: explore the environment

Objective:

• Become familiar with the environment, apply what has


been learned and prepare for the proper understanding of
the following topics in this session
Workshop:
Workshop:
Workshop:
Workshop:
Workshop:
Workshop:
Workshop:
Workshop:
Workshop:
Workshop:
Workshop:
Workshop:
Component Types
• Business Intelligence: group of connectors that cover the needs of reading or
writing in multidimensional or olap databases, outputs to Jasper reports,
management of changes in the database for slowly changing dimensions,
etc. (all related to Business Intelligence)

• Business: connectors for reading and writing of CRM systems (Centric,


Microsoft CRM, Salesforce, Sugar, Vtiger) or for reading and writing from Sap
systems. They also allow working with the document manager Alfresco.

• Custom Code: components to define our own custom code and be able to
use it integrated with the rest of Talend components. We can write
components in Java and Perl, as well as load libraries or customize Groovy
commands.
Component Types
• Data Quality: components for data quality management, such as filtering,
CRC* calculations, fuzzy logic searches, replacement of values, validation of
schemes against metadata, cleaning of duplicates, etc.

• Databases: connection, input or output connectors of the most popular


databases (AS400, Access, DB2, Firebird, Greenplum, HSQLdb, Informix,
Ingres, Interbase, JavaDB, LDAP, MSSQL Server, MaxDB, MySql, Netezza,
Oracle, Paraccel, PostreSQL, SQLite, Sas, Sybase, Teradaba, Vertica).

• ELT: components to work with databases in ELT mode (with the typical
transformations and processes of this type of systems).

* CRC: cyclic redundancy check (CRC) is an error-detecting code commonly used in digital networks and storage devices to detect accidental changes to raw data.
https://en.wikipedia.org/wiki/Cyclic_redundancy_check
Component Types
• File: controls for the management of files (existence verification, copy,
deletion, list, properties), for reading files of different formats (text, excel,
delimited, XML, mail, etc.) and for writing in them.

• Internet: components to access content stored on the Internet, such as Web


services, RSS feeds, SCP, Mom, Email, FTP servers and the like.

• Logs & Errors: controls for error management and logs in the process
definition.
Component Types

• Miscellaneous: various components, such as message windows,


verification of server operation, record generator, variable context
management, etc.

• Orchestration: components to generate the sequences and tasks


of orchestration and processing of jobs and subjobs defined in our
transformations (loops generation, execution of previous or
subsequent jobs, waiting processes for files or data, etc.).
Component Types

• Processing: components for processing data flows, such


as aggregation, mapping, transformations, filtering,
denormalization, etc.

• System: components for interaction with the operating


system (command execution, environment variables, etc.)

• XML: components for working with XML data structures,


with parsing, validation or structure creation operations.
Data input and output components

• For an ETL it is essential to have a good collection of data


connections that allow you to access all types of systems.
Talend has a large number of components, and a
community that works by adding new options.

• As for databases, we can find from the most general ones


such as: MySQL, SQL Server, Oracle, Postgre; to those
with more specific applications such as Grenplum,
ParAccel or eXists.
Component Types
• Once we have the possibility of connecting the two
systems, we have to carry out the work itself. For this, we
have numerous components that allow us to manipulate
the data at our whim. We can do filters, type conversions,
approximation searches, unions, sort, make
replacements. And components such as tJava or
tJavaRow, which allow us to work the data by
programming our functions.
Component Types

Orchestration Components

• They are those that help control the execution of Jobs,


such as tPreJob, tPostJob and tRunJob, these help to
organize the execution of data processing, to validate the
prerequisites such as connections, error handling and
processes after the execution of Jobs, whether they were
successful or not.
Component Types

Other components

• Around the components of data manipulation, input and output;


We have many other components to provide the tool with high-
level functionalities. We can manipulate files in the file system,
access files via FTP and HTTP, send and receive emails, make
calls to the operating system, execute other Talend programs or
jobs, compress or decompress files, cryptographic functions.
Once again the community adds new functions; as a component
to access Google Analytics data through its API.

You might also like