Professional Documents
Culture Documents
eu/tag/etl/
Most of the time when talking about Talend jobs, people think of standard
ETL (Extract, Transform, Load). But in some cases theres the need to check
the incoming data before loading them into the target rather than just
transforming it. We refer to this process as E-DQ-L
(Extract, Data Quality, Load).
One of the things that you might want to check before loading is schema
compatibility. For example: you expect to get a String thats 5 long. If you, for
any reason, receive a String that is larger than 5, it will generate an error. Or
perhaps you expect a percent (in format BigDecimal like 0.19), but you
receive it as a string (19%). This example will result into a failing job with
an error saying Type mismatch: cannot convert
from dataType to otherDataType.
Before I continue this blog I would like to emphasize that all the solutions
below are possible with the Data Integration version of Talend, except for the
last one. The last option requires a Talend Data Quality license.
Lets create an example case: We want to extract data on a regular basis
from a third-party source which we cannot fully trust in terms of schemasettings. We know how many columns we can expect and we have a rough
idea of what it contains, but we do not fully trust the source to not give
incompatible data. We want to load the records that are valid and we want to
separately store the corrupt data for logging purposes. Ive gathered
several solutions for this problem:
1. Use rejected flow on an input-component
One thing you can do is reject the records as soon as you import them.
Disable die on error on the basic settings tab of you input-component and
then right-click it and select Reject. The rows will be rejected based on the
schema of the file. In the example below we put phone number as an integer
and as you can see 1 records is begin rejected. This is because the phone
number contains characters and therefore cannot be read as an integer. If
you did not disable the die on error-option then this component would
make the job fail.
3. Use a tFilter-component
You can make the data go through a filter-component before inserting it into
your target. You can (manually) decide whats allowed to go through. This
can be useful when your destination is not a database, in which case option
1 is most likely not available.
The rejected rows and their error message look like this:
Thats it for now. Theres probably a lot of other ways of checking schema
compatibility. Feel free to comment if you know any. Thank you for reading!
Posted in Talend | Tagged ETL, Talend | Leave a comment
In the first part of these entries we discussed how to test your expressions,
the importance of optimizing the appearance of a tLogRow component and
how to handle windows and views within Talend. This time around, we will be
talking about the different ways to get components into your job, how to
trace your dataflow and how to easily sync columns. As last time, this post
will be useful for both starting and experienced users.
4. Getting components into your job
There are many ways to get components into your job. Most people search
the palette (by either the search-function or by manually exploring the
folders) and drag/drop the components into their job. You can achieve the
same thing by simply clicking on a random place in your job and then type
the name of the component. Obviously this is only recommended once
youre familiar with the different components and their names.
When working with metadata, you can use certain shortcuts to save a bit of
time. Usually people just click on the metadata and then drop it onto their
job. This will pop up a window allowing you to choose which type of
component you want to use. Holding the Control-key while dragging the
component will directly create an Output-component. Holding Control+Shift
will result into an Input-component.
5. Syncing columns
Occasionally, you may have to change the schema of a certain component in
the middle of development. This might affect other components in your job.
In some cases, Talend asks if you want to propagate the changes youve
made (to the other components).
You may accidently close this window, click No or not get this message at
all, resulting in the following error: The schema from the input
link youroutputlinkis different from the schema defined in the component.
When this happens, you can go to the basic settings of the component that
has the error and click on Sync columns. The error should now be gone.
The moment you open the Debug run tab, youll immediately see extra
icons in your job. These magnifying glass icons indicate that details will be
shown when you debug-run your job. The result should look something like
this:
You can Pause and Resume the run at any time. You can also add breakpoints
if you like. Do this by right-clicking on a dataflow and then selecting Show
Breakpoint Setup.
This brings you to the Breakpoint tab of the data flow you clicked on. You
can also go there by clicking on the specific flow and manually selecting
Breakpoint. Lets add a breakpoint to pause our run whenever we come
across a record with Bloom as last name. Firstly, make sure to check the
Activate conditional breakpoint option. After that, click on the plus-icon
underneath the conditions. Then select the InputColumn we want to put our
condition on, in our case this is Last_name, and add a value (Bloom in
this example). The default Operation is Equals, which is the one we want.
You can also specify an Operation if you need to, but this is unnecessary for
this case.
You can add multiple breakpoints if you like. Whenever you debug run your
job now, it will stop at a record where the Last_name is Bloom (if any
exist).
Thats it for now. Thank you for reading!
Posted in data integration, ETL, Talend, Tips and Tricks | Tagged ETL, Talend | Leave a comment
This blog contains some convenient tips and tricks that will make working
with the open source tool Talend for data integration a lot more efficient. This
blogpost will be especially useful for people who are just discovering this
amazing tool, yet I am sure that people who have been using it for a while
will also find it very helpful. These series of tips will be spread over multiple
blog entries so make sure to check back often for future tips!
1. Testing expressions in the tMap component
Using the tMap component, you have the possibility to test your expressions.
This way you can easily see whether or not the result is what you expected it
to be. You can also use this to determine whether or not your expression will
error. Lets create an example.
Weve got details of employees as input for our tMap. We would like the first
name to be shown in uppercase. First of all, go into the expression builder by
clicking the ellipsis next to your expression.
There are three types of Modes that you can choose between:
Basic
Basic will generate a new line for each record, separated by the Field
Separator youve chosen (see image above). When using basic mode, I
highly recommend to check the Print header option when working with
multiple column records or multiple outputs, purely for visibility reasons.
The table mode shows the records and their headers in a table-format,
including the name of the component that generated this output (in our
case: tLogRow_1). This emphasizes the importance of properly naming
everything, especially when you have multiple components that generate
output. In this case, it would have been better to rename our component to
EMPLOYEES. Personally, I prefer this mode.
Vertical mode will show a table for each one of your records.
The output mode you decide to use depends on what youre trying to
visualize. For example, when your goal is to show a single string, I would
recommend using the basic mode. But when you have multiple table outputs
(for example: departments, customers and employees in a single output), Im
certain the table mode would be the best option.
Sometimes your data is spread over multiple lines, resulting in an unclear
output, like shown in the image below.
To force the output to put all the data on one single line, you can uncheck the
Wrap option. This option is located underneath your output and will enable
a horizontal scrollbar.
Do you also want to be able to get data regarding tweets using Talend, as
shown in the image above? Read my previous blogpost and find out how!
3. Resetting windows and maximizing/minimizing them
Sometimes you accidently close a window and have a hard time finding a
way to get it back. You can very easily reset your environment by clicking on
Window Reset Perspective.
You can see all of the views by clicking on Windows Show View
Talend. Some of the views are not shown by default, such as Modules.
Modules can be used to import .jar-files without having to restart your studio,
which will most likely save you some time.
Lastly, because Talend is Eclipse-based, you have the possibility to maximize
and minimize windows. I personally use this function when examining the
output of a tLogRow-component including a lot of data. You can achieve this
by either double-clicking on the window or by right-clicking on it and
selecting Minimize/Maximize.
Thats it for now. I hope you enjoyed reading this blog and make sure to
return soon for future blogs!
Posted in data integration, ETL, Talend, Tips and Tricks | Tagged ETL, Talend | Leave a comment
NOTE There are some connections that dont allow you to export them as a
context. In that case youll have to create the context group and its variables
manually, add the group/variables to your job, and use the variables in the
properties of the components of your job.
After youve clicked the Export as context button youll see the Create/Edit
context group screen. Enter a name, purpose and description and click Next.
Now youll see all the context variables that belong to this context group.
Notice that Talend has already created all the context variables that are
needed for the HR connection. If you want to change their names you can
simply click them and they become editable.
Click the Values as table tab.
In the Values as table tab you can edit the values of the context variables by
simply clicking the value and changing it. To add a new context, click the
context symbol in the upper right corner.
After the window closes, youll see that an extra column appeared. Enter the
connection data of the production environment in the Production column and
click Finish.
In the connection window its possible to check the connection again, but this
time youll be prompted which connection you want to check.
When using a connection that has been exported as a context in a job, you
have to include the context variables in order for your job to be able to run.
Go to the context tab and click the context button in the bottom left.
NOTE When using one of the newer versions, Talend proposes to add missing
context variables whenever you try to run a job, because of this you dont
need to add them manually as described in this example.
Select the context group that contains the context variables, in our case the
HR context group.
NOTE A context group can also be added to a job by simply selecting the
context from the repository, dragging it towards the context tab of the job,
and dropping it there.
Once youve added the context group to the job, its possible to run the job
for both the development and production environment by selecting the
context in the dropdown menu of the Run tab.
Posted in data integration, ETL, Talend | Tagged Contexts, ETL, Talend | 1 Comment