You are on page 1of 300

Participant Guide

Talend Data Stewardship


Version 6.4
Copyright 2017 Talend Inc. All rights reserved.
Information in this document is subject to change without notice. The software described in this document is furnished under a license
agreement or nondisclosure agreement. The software may be used or copied only in accordance with the terms of those agree-
ments. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or any means electronic
or mechanical, including photocopying and recording for any purpose other than the purchaser's personal use without the written
permission of Talend Inc.
Talend Inc.
800 Bridge Parkway, Suite 200
Redwood City, CA 94065
United States
+1 (650) 539 3200
Welcome to Talend Training

Congratulations on choosing a Talend training course.

Working through the course


You will develop your skills by working through use cases and practice exercises using live software. Completing the exercises is crit-
ical to learning!
If you are following a self-paced, on-demand training (ODT) module, and you need an answer to proceed with a particular exercise,
use the help suggestions on your image desktop. If you can’t access your image, contact customercare@talend.com.

Exploring
You will be working in actual Talend software, not a simulation. We hope you have fun and get lots of practice using the software!
However, if you work on tasks beyond the scope of the training, you could run out of time with the environment, or you could mess up
data or Jobs needed for subsequent exercises. We suggest finishing the course first, and if you have remaining time, explore as you
wish. Keep in mind that our technical support team can’t assist with your exploring beyond the course materials.

For more information


Talend product documentation (help.talend.com)
Talend Community (community.talend.com)

Sharing
This course is provided for your personal use under an agreement with Talend. You may not take screenshots or redistribute the con-
tent or software.
Intentionally blank

CONTENTS | Participant Guide


CONTENTS
LESSON 1 Data Stewardship in Context
Concepts 8
LESSON 2 Getting Started
Concepts 12
Overview 14
Exploring the Environment 16
Creating Users in TAC 22
Connecting to Talend Data Stewardship 36
Review 42
LESSON 3 Creating a campaign
Concepts 44
Overview 49
Creating a Data Model 50
Creating a Campaign 69
Review 77
LESSON 4 Creating and Handling tasks
Concepts 80
Overview 83
Creating Tasks 84
Assigning and Resolving Tasks 113
Review 137
LESSON 5 Grouping Campaign
Concepts 140
Overview 143
Creating a Data Model 144
Creating a Grouping Campaign 152
Creating Grouping Tasks 161
Handling Grouping Tasks 177
Review 192
LESSON 6 Merging Campaign
Concepts 194
Overview 199
Creating a Merging Campaign 200
Creating Merging Tasks 209
Resolving Merging Tasks 231
Querying Tasks 251
Review 263
LESSON 7 Arbitration Campaigns
Concepts 266
Overview 269
Creating an Arbitration Campaign 270
Creating Arbitration Tasks 276
Handling Arbitration Tasks 290
Review 299

CONTENTS | Participant Guide


LESSON 1
Data Stewardship in Context
In this lesson you have an overview of Talend Data Stewardship.
This chapter discusses:

Concepts 8
Concepts

8 | Talend Data Stewardship - Participant Guide


Data stewardship is about defining and maintaining
data models, documenting data, cleansing data, and
defining rules and policies. It enables the imple-
mentation of well-defined data governance processes
covering several activities, including monitoring, recon-
ciliation, refining, deduplication, cleansing, grouping
and aggregation, to help deliver quality data to applic-
ations and end users. In addition to improved data
integrity, data stewardship helps ensure that data is
used consistently throughout the organization, and
that it reduces data ambiguity through metadata and
semantics. Talend Data Stewardship can trigger val-
idation workflows for tasks that should be double
checked. Because it is easy to use through a guided
experience and is workflow driven, anyone can par-
ticipate in the data curation efforts, with clear respons-
ibilities and efficient tools for execution. 

LESSON 1 | 9
Because Data Stewardship is fully integrated with the
Talend platform, it can be associated with any data
flow and integration style that Talend can manage, so
you can embed governance and stewardship in data
integration flows, MDM initiatives, and matching pro-
cesses.

10 | Talend Data Stewardship - Participant Guide


LESSON 2
Getting Started
This chapter discusses:

Concepts 12
Overview 14
Exploring the Environment 16
Creating Users in TAC 22
Connecting to Talend Data Stewardship 36
Review 42
Concepts

A critical component of data governance, data stew-


ardship is the process of managing the data life cycle
from curation to retirement.

Campaign owners can define different campaigns,


including arbitration, resolution, merging, and group-
ing; engage data stewards who contribute to each
campaign; define data structure used by campaigns;
refer to Talend Jobs to load tasks into campaigns;
retrieve tasks from campaigns; and assign tasks in
campaigns to data stewards.
Data stewards can explore data that relates to their
tasks, resolve tasks on a one-to-one basis or for a set
of records, delegate tasks to colleagues, and monitor
and audit stewardship campaigns and error res-
olution.

Campaign owners can access more features from the


left menu than data stewards.

12 | Talend Data Stewardship - Participant Guide


LESSON 2 | 13
Overview
Talend Data Stewardship is a comprehensive tool for configuring and managing data assets. Managing these assets may entail per-
forming certain actions, such as accepting, rejecting, doing arbitration, reviewing, merging, and cleansing. Talend Data Stewardship
helps organize data interaction in scenarios that require human intervention. Then, before writing data to a target application, a data
steward can use the interface to resolve issues in listed tasks.
The Data Stewardship solution is a modular suite comprised of several applications working together to provide a distributed devel-
opment environment.
This diagram shows the interactions between modules:

There are several functional blocks:


The Clients block includes a web browser and Talend Studio.
Using a web browser, campaign owners and data stewards can access the Data Stewardship web application,
where they can manage campaigns and tasks. They can also add, remove, and edit semantic types.
In Talend Studio, campaign owners can create Jobs to load, read, and delete tasks from campaigns created in the
web application.
The Servers block includes the Talend Data Stewardship application server, connected to Talend Administration Center
and optionally to the Talend Dictionary Service server. This block also includes a Kafka server used for messaging.
The Talend Identity and Access Management Service is used to enable Single Sign-On.
TAC allows administrators to manage licenses, users, and roles. Users can be assigned one or more predefined
roles that have specific rights within Data Stewardship.
Talend Dictionary Service allows campaign owners to manage the semantic categories used on data in the web
application. It communicates with the Kafka server.
The Kafka server is used to exchange messages on every single event done in the web application such as modi-
fying a data model, using a data model in a campaign, auditing actions performed on tasks and managing semantic
categories using Talend Dictionary Service.
IAM ensures SSO authentication
The Databases blocks contain the databases used with TAC and two MongoDB databases.
The Administration database is used to manage user accounts and rights, the Audit database is used to evaluate
aspects of Jobs implemented in Studio, and the Monitoring database is used to monitor execution of technical pro-
cesses and service calls.
One MongoDB database is used to store all campaigns and tasks. Nothing is saved directly on your computer.
The other MongoDB database is used to store semantic types for the dictionary service.
You can install the different modules on separate computers or install multiple applications on the same computer.

14 | Talend Data Stewardship - Participant Guide


In this course, you use a complete Talend environment, hosted on a single virtual machine (VM) that contains all the items you need.
This environment is similar to what you would find in a new Talend installation, with default parameters and configuration.
In this lesson, you explore the Data Stewardship server configuration, then connect to TAC and create several users.
Finally, you start the Data Stewardship server and test your connection to the Data Stewardship web UI.

Objectives
After completing this lesson, you will be able to:
Describe the Data Stewardship architecture
Create Data Stewardship users in TAC
Connect to the Data Stewardship web UI

Next step
You are ready to start exploring the environment.

LESSON 2 | 15
Exploring the Environment

Overview

In this section, you open the configuration file for the Talend Data Stewardship server and check the configuration of all
related components. Then you confirm that all the application services you need are running.

Explore the Data Stewardship configuration file


1. OPEN THE DATA STEWARDSHIP CONFIGURATION FILE
In C:\Talend\6.4.1\tds\apache-tomcat\conf, open data-stewardship.properties.
a. Using Windows Explorer, navigate to the C:\Talend\6.4.1\tds\apache-tomcat\conf folder.

16 | Talend Data Stewardship - Participant Guide


b. Right-click the data-stewardship.properties file and select Edit with Notepad++.

2. EXPLORE THE CONFIGURATION FILE


Notice the properties configured for Talend Data Stewardship.

LESSON 2 | 17
a. The Talend IDP section describes the Single Sign-On properties for Talend Data Stewardship
b. MongoDBsettings are also configured in this configuration file. MongoDB is a prerequisite for the Data Stewardship
server, as it is used to store metadata for campaigns and tasks. In your training environment, MongoDB is installed
and running, and a MongoDB user has been created with required credentials.
c. Kafka settings are also configured in this file, as they ensure service messaging in the Data Stewardship server. Kafka
topics are also used for the messaging between Data Stewardship and dictionary service.

18 | Talend Data Stewardship - Participant Guide


d. The specific dictionary service configuration appears at the end of this file. This configuration is required in order to
use DQ-specific semantic types.

Confirm that services are running


1. OPEN WINDOWS SERVICES
On the Windows taskbar, click the Services button.

2. CONFIRM THAT TALEND ZOOKEEPER IS RUNNING
Search for Talend Zookeeper (a prerequisite for Talend Kafka service) and verify that the status is Running.

3. CONFIRM THAT TALEND KAFKA IS RUNNING
Search for Talend Kafka and verify that the status is Running.

4. CONFIRM THAT TALEND MONGODB IS RUNNING

LESSON 2 | 19
Search for Talend MongoDB and verify that the status is Running.

5. CONFIRM THAT TAC IS RUNNING
Search for Talend Administration Center 6.4.1.
If the service is not running, click the Start button and wait until the status is Running.

6. CONFIRM THAT DICTIONARY SERVICE IS RUNNING

20 | Talend Data Stewardship - Participant Guide


Search for Talend Dictionary Service 6.4.1 and verify that the status is Running.

Next step
You are ready to connect to TAC and create users.

LESSON 2 | 21
Creating Users in TAC

Overview

In TAC, administrators create users and assign them predefined roles.


The core elements of Talend Data Stewardship are campaigns and tasks.
Data Stewardship has two predefined roles: campaign owner and data steward.
Campaign owners can define different types of campaigns, generate tasks, and assign tasks to data stewards
Data stewards mainly handle tasks created by campaign owners
In this section you will connect to TAC and define the following users:
One technical campaign owner, who mainly works on developing Talend Studio Jobs to generate tasks for cam-
paigns
One business campaign owner, who handles the business aspects of tasks in the Data Stewardship web UI and
also has data steward permissions
Two data stewards, who are in charge of completing tasks generated by campaign owners

22 | Talend Data Stewardship - Participant Guide


Connect to TAC
1. OPEN THE TAC URL
On the Start Windows menu, go to All Programs>Talend 6.4.1.

LESSON 2 | 23
Click Talend Administration Center.

2. FILL IN THE ADMIN USER CREDENTIALS


The TAC log-in page appears.
Log in with the default account credentials:
Login: security@company.com
Password: admin
Note: TAC user names must be in email address format.

24 | Talend Data Stewardship - Participant Guide


3. LOG IN
Click the Login button.
You are logged in with administrator privileges.
The WELCOME page appears:

LESSON 2 | 25
Create campaign owners in TAC
1. OPEN THE USERS MENU
On the Menu, in Settings, click Users.

2. ADD A USER

26 | Talend Data Stewardship - Participant Guide


On the Users tab, click Add.

3. CONFIGURE A CAMPAIGN OWNER USER


The first campaign owner user to define is the technical campaign owner.
Assign this user the campaign owner role.
a. Enter the following credentials:
Login: towner@company.com
First Name: Technical
Last Name: Owner
Password: talend
b. Select the Data Stewardship User check box.
c. For Type, select No Project Access.
d. For the Data Stewardship role, select Campaign Owner and click the Validate button.
Note: The Data Stewardship Role option appears only after selecting the Data Stewardship User check box.

e. Confirm that the Active check box is selected.

LESSON 2 | 27
f. Verify that the user details appear as in the screenshot:

4. SAVE THE USER

28 | Talend Data Stewardship - Participant Guide


To add the new user to the list, click Save.

5. ADD A SECOND USER


On the Users menu, click Add.

6. CONFIGURE THE BUSINESS CAMPAIGN OWNER USER


The business campaign owner defines data models and campaigns, as well as validates business tasks.
Assign this user both the campaign owner and data steward roles.
The business campaign owner also ensures integration between Data Stewardship and the dictionary service, so this user
needs operation manager rights on Data Quality projects.
a. Enter the following credentials:
Login: bowner@company.com
First Name: Business
Last Name: Owner
Password: talend
b. For Type, select Data Management.
c. In the Role drop-down menu, select Operation manager and click Validate.

LESSON 2 | 29
d. Select the Data Stewardship User check box.
e. For Data Stewardship Role, select Data Steward and Campaign Owner, then click Validate.

f. Confirm that the Active check box is selected.

30 | Talend Data Stewardship - Participant Guide


g. Verify that the user details appear as in the screenshot:

7. SAVE THE USER

LESSON 2 | 31
To add the user to the list, click Save.

Create data stewards in TAC


1. ADD A DATA STEWARD BUSINESS USER
Add a user who has only data steward permissions.
a. On the Users tab, click Add.
b. For the data steward user, enter the following credentials:
Login: jclark@company.com
First Name: James
Last Name: Clark
Password: talend
c. Select the Data Stewardship User check box.
d. For Type, select No Project Access.
e. For the Data Stewardship Role, select Data Steward.
f. Make sure the Active check box is selected.

32 | Talend Data Stewardship - Participant Guide


g. Verify that the user details appear as follows:

h. Click Save.
2. ADD A SECOND DATA STEWARD USER
Add another user who has only data steward permissions.
a. On the Users tab, click Add.
b. For the data steward user, enter the following credentials:
Login: dsmith@company.com
First Name: Dave

LESSON 2 | 33
Last Name: Smith
Password: talend
c. Select the Data Stewardship User check box.
d. For Type, select No Project Access.
e. For Data Stewardship Role, select Data Steward.
f. Make sure the Active check box is selected.
g. Verify that the user details appear as follows:

34 | Talend Data Stewardship - Participant Guide


h. Click Save.
3. VERIFY THE LIST OF USERS
Confirm that the user list appears as follows:

Next step
Now you can start the Data Stewardship server and test your connection.

LESSON 2 | 35
Connecting to Talend Data Stewardship

Overview

You are ready to start the Talend Data Stewardship server and test your connection as a campaign owner and data stew-
ard. You can also explore the menus available to those in different roles.

Start Data Stewardship


1. OPEN WINDOWS SERVICES
On the Windows taskbar, click the Services button.

2. VERIFY THAT TALEND DATA STEWARDSHIP IS RUNNING.


If the service is not running, click the Start button and wait until the status is Running.

Check the connection


1. OPEN THE DATA STEWARDSHIP WEB UI
In a web browser, enter http://localhost:19999

36 | Talend Data Stewardship - Participant Guide


2. CONNECT AS A TECHNICAL CAMPAIGN OWNER
Connect to Data Stewardship as towner@company.com.
a. Enter these credentials:
Email: towner@company.com
Password: talend

b. Click the LOG IN button.


This screen appears:

LESSON 2 | 37
c. Notice that two menus are available for a user with campaign owner permissions: Campaigns and Data Models.
3. LOG OUT OF THE DATA STEWARDSHIP WEB UI
On the menu in the upper right corner, click the Logout button.

4. CONNECT AS A BUSINESS CAMPAIGN OWNER


Connect to Data Stewardship as bowner@company.com.
a. Enter these credentials:
Email: bowner@company.com

38 | Talend Data Stewardship - Participant Guide


Password: talend

b. Click LOG IN.


This screen appears:

c. Notice that four menus are available to a user with both campaign owner and data steward permissions. Campaigns,
Data Models, and Semantic Types are available only to campaign owners, while the Tasks menu is available to data
stewards.
5. LOG OUT OF THE WEB UI
On the menu in the upper right corner, click Logout.

6. CONNECT AS A DATA STEWARD


Connect to Data Stewardship as James Clark.
a. Enter these credentials:
Email: jclark@company.com

LESSON 2 | 39
Password: talend

b. Click SIGN IN.
This screen appears:

c. Notice that just one menu is available for a user with only data steward permissions.
7. LOG OUT OF THE WEB UI

40 | Talend Data Stewardship - Participant Guide


On the menu in the upper right corner, click Logout.

Next step
You have almost finished this section. Time for a quick review.

LESSON 2 | 41
Review
You began this lesson by learning about the architecture of the Talend Data Stewardship application, and you explored the con-
figuration file. Then you connected to TAC and created two campaign owner users and two data steward users. You learned that
Data Stewardship user roles and permissions are handled in TAC.
Finally, you started the Data Stewardship server and tested your connection using the Web UI.

Additional information
Talend documentation:
About Talend Data Stewardship

42 | Talend Data Stewardship - Participant Guide


LESSON 3
Creating a campaign
This chapter discusses:

Concepts 44
Overview 49
Creating a Data Model 50
Creating a Campaign 69
Review 77
Concepts

44 | Talend Data Stewardship - Participant Guide


LESSON 3 | 45
46 | Talend Data Stewardship - Participant Guide
A campaign contains tasks of a homogeneous type.
The task type defines what the data steward must do
to resolve the campaign tasks.​ A campaign can have
one or several owners.​ Data Stewardship supports
several task types, depending on campaign type:​ Res-
olution tasks allow assignees to correct one or several
fields of the data record. The outcome of a resolution
task is the curated record corrected by data stewards.​
Merging tasks allow assignees to merge several poten-
tial duplicate source records in a single (“golden”)
record. The outcome of a merging task is the golden
record produced by data stewards.​ Arbitration tasks
can be assigned to a label chosen from a list of arbit-
ration choices defined by the campaign owner. The
outcome of an arbitration task is the arbitration choice
made by data stewards.​ Grouping campaigns define a The role of the actors of the campaign can be created
list of possible arbitration choices for grouping records. and defined.
The outcome of a grouping task is the choice made by
data stewards on the group of records.​

A validation step can be used if needed in the workflow

LESSON 3 | 47
During the exercise of this lesson, you will build a data
model and create a resolution campaign.

48 | Talend Data Stewardship - Participant Guide


Overview
To help deliver high-quality data to applications and end users, Talend Data Stewardship enables implementation of well-defined
data governance processes that cover several activities, including monitoring, reconciliation, refining, deduplication, grouping, cleans-
ing, and aggregation.
Campaign owners and data stewards use Data Stewardship capabilities to perform data stewardship and management tasks.
The campaign is the core concept of Data Stewardship. The campaign owner determines which configuration assets are required.
What do the tasks cover? (Examples: data structure and validation constraints)
What must data stewards do to resolve campaign tasks?
Which data stewards work on campaign tasks?
How do data stewards collaborate to resolve campaign tasks?
A campaign has one or several owners, and it contains tasks of a homogeneous (similar) type. The task type defines what the data
steward must do to resolve campaign tasks.
Based on the campaign type, Data Stewardship supports several task types:
Resolution: Tasks allow assignees to correct one or several fields of the data record. The outcome of a resolution task is
the curated record corrected by data stewards.
Merging: Tasks allow assignees to merge several potential duplicate source records in a single "golden" record. The out-
come of a merging task is the golden record produced by data stewards.
Arbitration: Tasks can be assigned to a label chosen from a list of arbitration options defined by the campaign owner. The
outcome of an arbitration task is the result of choices made by data stewards.
Grouping: Tasks allow assignees to validate (or not) the correlation of pairs or groups of records . The outcome of a group-
ing task is the choice made by data stewards on the group of records.
Consider a scenario in which a company needs to improve the quality of its library product items.
A resolution campaign allows you to validate product information against a data model, which defines the structure of tasks.
In addition to defining the data model, you will add new semantic types.
Here is a diagram of how it works:

Objectives
After completing this lesson, you will be able to:
Create a data model
Add semantic types
Define a resolution campaign

Next step
You are ready to create a data model.

LESSON 3 | 49
Creating a Data Model

Overview

In this course, you will work on data models that stores movies, books, video games, and records with descriptive details
and prices.
In this section, the first data model you will create defines attributes that have data types and specific constraints.
From TDS, you will be able to use the standard data types (for example, integer and string), as well as a set of semantic
types provided by the dictionary service. You will also learn how to add your own semantic types to the dictionary service
so you can reuse them in data models as needed.

Create a data model


1. CONNECT TO THE DATA STEWARDSHIP WEB UI
Connect as a business owner by entering these credentials:
Email: bowner@company.com
Password: talend
Click SIGN IN.
The home page appears.

2. ADD A DATA MODEL


Add a data model for library items.
a. On the menu, click DATA MODELS.

50 | Talend Data Stewardship - Participant Guide


b. Click the ADD DATA MODEL button.

c. The data model form appears.

3. DEFINE THE DATA MODEL


In the NAME text box, enter Library Products, and in the DESCRIPTION box, enter a short description.

4. DEFINE THE FIRST ATTRIBUTE


By default, below the description, the data model contains an attribute.
a. Designate the Identifier as CODE and the Name as Code.
The identifier corresponds to the technical field while the name corresponds to the displayed field name.

LESSON 3 | 51
b. Change the data type to Integer.
c. The ALLOW EMPTY VALUES slide button indicates whether the library item identifier is optional. In this case the
identifier is mandatory, so deactivate the button.
d. Observe that the attribute name and type appear on the left.

5. ADD AN ITEM NAME ATTRIBUTE


To add a second attribute, on the menu on the left, click the ADD ATTRIBUTE button.
The attribute details form appears in the main frame.
Define it like this:
Identifier: NAME
Name: Name
Data type: Text
Deactivate the ALLOW EMPTY VALUES slide button.

6. ADD AN OPTIONAL AUTHOR ATTRIBUTE


Using the same method, add an attribute used for storing the name of the author/director for the library item.

52 | Talend Data Stewardship - Participant Guide


Define it like this:
Identifier: AUTHOR_DIRECTOR
Name: Author/Director
Data type: Text
To make the field optional, leave the ALLOW EMPTY VALUES slide button activated.

7. ADD AN OPTIONAL ITEM CATEGORY ATTRIBUTE


Add an attribute for storing the item category with certain allowed values.
Define it like this:
Identifier: CATEGORY
Name: Category
Data type: Text
Activate the Define a list of values slide button and define three possible values: Books, Movies, and Video
Games.
Leave the ALLOW EMPTY VALUES slide button activated.

LESSON 3 | 53
8. ADD AN OPTIONAL ITEM SUBCATEGORY ATTRIBUTE
Add an attribute for storing the library item subcategory (for example, science fiction or drama)
Define it like this:
Identifier: SUBCATEGORY
Name: Sub-Category
Data type: Text
Leave the ALLOW EMPTY VALUES slide button activated.

54 | Talend Data Stewardship - Participant Guide


9. ADD AN OPTIONAL ITEM FORMAT ATTRIBUTE
Add an attribute for storing the library item format (for example, Blu-ray, DVD)
Define it like this:
Identifier: FORMAT
Name: Format
Data type: Text
Leave the ALLOW EMPTY VALUES slide button activated.

LESSON 3 | 55
10. ADD AN OPTIONAL PRICE ATTRIBUTE
Add an attribute for storing the price of the library item. The data type for this field is a decimal point followed by two digits.
Define it like this:
Identifier: PRICE
Name: Price
Data type: Decimal
Leave the ALLOW EMPTY VALUES slide button activated.

56 | Talend Data Stewardship - Participant Guide


11. ADD AN OPTIONAL POPULARITY ATTRIBUTE
Add an attribute for storing the popularity of the library item. The data type for this attribute is an integer between 1 and 5.
Define it like this:
Identifier: POPULARITY
Name: Popularity
Data type: Integer
Designate the Minimum as 1 and the Maximum as 5.
Leave the ALLOW EMPTY VALUES slide button activated.

LESSON 3 | 57
12. ADD AN OPTIONAL AVAILABILITY ATTRIBUTE
Add an attribute for storing the availability of each library item. The data type for this field is boolean; only values true and
false are accepted.
Define it like this:
Identifier: AVAILABILITY
Name: Availability
Data type: Boolean
Leave the ALLOW EMPTY VALUES slide button activated.

58 | Talend Data Stewardship - Participant Guide


13. ADD AN OPTIONAL DATA SOURCE ATTRIBUTE
Add an attribute for storing the data source of each library item. Assume that library data comes from either SAP or the leg-
acy product catalog.
Define it like this:
Identifier: SOURCE
Name: Source
Data type: Text
Select the DEFINE A LIST OF VALUES slide button and define two possible values: SAP and LEGACY
PRODUCT CATALOG.
Leave the ALLOW EMPTY VALUES slide button activated.

LESSON 3 | 59
14. VERIFY THE DATA MODEL STRUCTURE
Confirm that the data model looks like this:

60 | Talend Data Stewardship - Participant Guide


15. SAVE THE DATA MODEL
Save the data model.
a. In the lower right corner, click the ADD DATA MODEL button.

b. The data model appears on the list of available data models.

LESSON 3 | 61
Add a semantic category to the dictionary service
As you may have noticed while creating the data model, many predefined data types are available. In addition to the standard ones,
you can choose predefined semantic categories from the dictionary service.

Now you will add a semantic category to the list of predefined data types. This category is based on a regular expression that stores
the publishing years of library items.
Assume that the publishing year allows:
Years (1 to 4 digits)
An interval of years (1 to 4 digits-1 to 4 digits)
The associated regular expression would be ^([0-9]{1,4})(-[0-9]{1,4})?$
1. OPEN THE SEMANTIC TYPES MENU
On the menu, click Semantic Types.

2. ADD A NEW SEMANTIC TYPE


The list of existing semantic types is displayed.

62 | Talend Data Stewardship - Participant Guide


a. Select the Add Semantic Type button.

b. The Add Semantic Type form appears.

c. Define it like this:


Name:PUBLISH YEAR
Description: Publishing Year(s)
Leave the Type field with the Regular expression value.
Leave the Use for validation slide button activated.
Leave the Content field with the Any character value.
Validation pattern: ^([0-9]{1,4})(-[0-9]{1,4})?$

LESSON 3 | 63
d. Verify the semantic type details appear as in the screenshot then click the Save and publish button.

Use semantic category data types in a data model


You are ready to enrich the Library Products data model with two new attributes:
One that stores the item language: you will use a predefined semantic category
One that stores the item publishing year(s): you will use the PUBLISH YEAR semantic type that you created
1. REOPEN THE DATA MODEL DEFINITION
in Data Stewardship, go to the data model definition.
If you logged out, log back in as the business owner. Click DATA MODELS and the Library Products data model.

2. ADD AN OPTIONAL LANGUAGE ATTRIBUTE


Add an optional attribute for storing the language of the library item.
Define it like this:
Identifier: LANGUAGE
Name: Language
Data type: Language
Leave the ALLOW EMPTY VALUES slide button activated.

64 | Talend Data Stewardship - Participant Guide


3. ADD AN OPTIONAL YEAR ATTRIBUTE
Add an optional attribute for storing the publishing year of library items.
Define it like this:
Identifier: YEAR
Name: Year
Data type: PUBLISH YEAR
Leave the ALLOW EMPTY VALUES slide button activated.

LESSON 3 | 65
4. REORDER ELEMENTS
Move the Year and Language attributes ahead of the Category attribute.
a. You can change the order of fields by dragging and dropping or using the arrows in front of the attribute. The arrows
appear only while hovering your mouse over the attribute.

66 | Talend Data Stewardship - Participant Guide


b. The field order looks like this:

LESSON 3 | 67
5. SAVE THE DATA MODEL
To save the data model, in the lower right corner, click the EDIT DATA MODEL button.

Next step
Now you will create a campaign based on the defined data model.

68 | Talend Data Stewardship - Participant Guide


Creating a Campaign

Overview

Consider a scenario in which the company product catalog has product records with some values that need to be
provided or corrected. You must approve product information against a data model to make sure the catalog contains the
most complete and accurate information. Once this is done, the clean data can be uploaded to any target system, for
example, a Talend Master Data Management (MDM) system.
In this section, you will create a resolution campaign to correct library product items.

A campaign contains:
General characteristics
Roles
A data model
A workflow

Create a resolution campaign


1. ADD A CAMPAIGN
a. Still logged in as the business owner, click Campaigns.

b. On the Campaigns page, click the ADD CAMPAIGN button.

c. The campaign form appears.

LESSON 3 | 69
2. DEFINE THE GENERAL CHARACTERISTICS
Designate the name, description, and type of campaign.
a. Enter a name and description for the campaign:
Name: LIBRARY PRODUCTS RESOLUTION
Description: check the records in the library catalog and cure anomalies
b. For TYPE, click the RESOLUTION button.
The form is updated.

3. DEFINE THE CAMPAIGN OWNERS


By default, the business owner—the person creating the campaign—is assigned to it. You can add a second campaign
owner with rights to modify the campaign and create associated tasks. You can also add a technical campaign owner who
adds tasks to the campaign in Talend Studio.
Click in the Add a campaign owner text box. On the list, select towner@company.com.

4. DEFINE ROLES IN THE RESOLUTION CAMPAIGN


As a campaign owner, you can define roles to decide which data stewards are part of the campaign and which can access
various tasks during the campaign workflow.

70 | Talend Data Stewardship - Participant Guide


a. Next to STEWARDS, click the edit icon.

b. Change the role name to DATA ANALYST (uppercase) and save by clicking the check mark.

c. Click in the Add a steward text box and select the data stewards you created earlier, jclark@company.com and
dsmith@company.com.

LESSON 3 | 71
d. Click ADD A ROLE to add a second role named DATA VALIDATOR(uppercase). To save, click the check mark.

e. Click in the Add a steward text box and select the email address of the business owner, who will act as task validator.

5. DEFINE THE DATA MODEL FOR THE RESOLUTION CAMPAIGN


The next step is to define the structure of the data that needs to be corrected.

72 | Talend Data Stewardship - Participant Guide


a. Under Model, click in the Select a data model text box and select the Library Products data model.

b. The form is updated with the data model (scroll down if necessary). You can now define access rights for the two
roles, which, by default, have editing rights in all fields.

6. RESTRICT ACCESS TO THE DATA MODEL


Here are some reasons you might mark an attribute "read only" or hide it:

LESSON 3 | 73
Read-only access is useful if a data steward needs to access information to make a decision but is not allowed to
change the value. For instance, this could happen with foreign keys—unique IDs of other elements linked to the
entity that the steward is viewing. Or you may have data that must not be changed, for instance, the customer sup-
port level).
Hiding an attribute is useful if the information is sensitive (such as financial information) and should not be seen by
the data steward, or if it is irrelevant to the data steward (for example, ID numbers) but still needs to be propagated
as part of the task.
Restrict access as follows:
For a data analyst, allow read-only access to the Code category
For a data analyst, hide the price

7. DEFINE A WORKFLOW IN THE RESOLUTION CAMPAIGN


This corresponds to the definition of steps necessary to resolve the campaign tasks.

You must also define which data stewards can access which task state or transition as part of the campaign workflow.
a. To add a validation step to the workflow necessary to resolve the tasks in the campaign, click the switch next to
Enable validation step.
Notice that a step is added to the workflow. This type of workflow is used when there is a need to validate tasks
handled by data stewards.

74 | Talend Data Stewardship - Participant Guide


b. This type of campaign uses the simple default workflow, so deactivate the Enable validation step button.

c. Below NEW, click in the Add a role text box and select DATA ANALYST. This grants the data analyst access to the
campaign tasks in the New state.
d. Below RESOLVED, click in the Add a role text box and select Data Validator.
This grants the data validator access to the campaign tasks in the Resolved state.

8. COMPLETE CAMPAIGN CREATION


Save the campaign.

LESSON 3 | 75
a. To save the campaign, in the lower right corner, click the ADD CAMPAIGN button.

b. The campaign appears on the list of campaigns.

c. To display the Edit the campaign and Remove the campaign icons, hover on the top-right corner of the LIBRARY
PRODUCTS RESOLUTION section.
Be careful when using the Remove the campaign icon because the deletion of a campaign is irreversible.

Next step
You have almost finished this section. Time for a quick review.

76 | Talend Data Stewardship - Participant Guide


Review
In this lesson, you learned how to create a data model in Talend Data Stewardship. You also learned how to add new semantic
types.
Then you created a campaign by completing these tasks:
Choosing the type of campaign
Creating roles for the campaign
Defining the data model for the data to be corrected and assigning permissions to fields for previously defined roles
Creating workflows for the campaign tasks

Additional information
Talend knowledge base article documentation:
Managing campaigns

LESSON 3 | 77
Intentionally blank
LESSON 4
Creating and Handling tasks
This chapter discusses:

Concepts 80
Overview 83
Creating Tasks 84
Assigning and Resolving Tasks 113
Review 137
Concepts

80 | Talend Data Stewardship - Participant Guide


LESSON 4 | 81
82 | Talend Data Stewardship - Participant Guide
Overview
In the previous lesson, you acted as the campaign owner and created a data model and campaign in the Talend Data Stewardship
Web UI.
To create tasks for this campaign, you will use the Talend Data Integration (DI) suite. Specific DI components allow you to create,
delete, and read tasks from a campaign in Talend Studio (the campaign owner role is required).
After creating a couple of tasks in Data Stewardship, you will act as the data steward to resolve them.
You will see how data stewards can delegate tasks among each other to split work, as well as resolve and validate tasks.

Objectives
After completing this lesson, you will be able to:
Use Talend Studio to create tasks for a campaign
Use Talend Studio to delete tasks from a campaign
Delegate tasks
Resolve and validate tasks

Next step
You are ready to use Data Integration to create tasks for a campaign.

LESSON 4 | 83
Creating Tasks

Overview

In this section, you will use Talend Studio to create tasks for the resolution campaign you created.
You will create a new Studio project, create a Job that reads a data file exported from SAP, and create a resolution task
for each record in that file.

This diagram shows how it works:

You will also learn how to delete all campaign tasks from a Job.

Create a project in Talend Studio


1. START TALEND STUDIO
On the Windows taskbar, click the Studio icon.

2. CREATE A NEW PROJECT


Create a project called TDS.
a. Select the Create a new project radio button. In the text box, enter TDS.

84 | Talend Data Stewardship - Participant Guide


b. Click Create.
The project appears on the list of existing projects.

c. Click Finish.
The TalendForge page may appear.
Click Skip this step.
The Integration perspective (which you may remember from the Talend Data Integration Basics course) opens.

Create metadata
1. CREATE FILE DELIMITED METADATA

LESSON 4 | 85
In the Repository, expand Metadata.
Right-click File delimited and select Create file delimited.

2. CONFIGURE FILE DELIMITED METADATA


Configure metadata for the library products file exported from SAP.

86 | Talend Data Stewardship - Participant Guide


a. In the the Name text box, enter Library_Products.

b. Click Next.
In the C:/StudentFiles/TDS folder, select the Library_data.csv file.

LESSON 4 | 87
For the Format, select WINDOWS.

88 | Talend Data Stewardship - Participant Guide


c. Click Next.
Select the Set heading row as column names check box and click Refresh Preview.

d. Click Next.
Change the Name to Products_metadata.

LESSON 4 | 89
For all the fields, change the data type to String and leave Length empty.

e. Click Finish.
The metadata appears in the repository in Metadata>File delimited.

3. CREATE DATA STEWARDSHIP METADATA

90 | Talend Data Stewardship - Participant Guide


In the Repository, expand Metadata.
Right-click Data Stewardship and select Create Data Stewardship.

4. CONFIGURE DATA STEWARDSHIP METADATA


Define connection details for Data Stewardship.
a. Enter these parameters:
Name: TDS_Connection
URL: http://localhost:19999/data-stewardship/
Username: towner@company.com
Password: talend

LESSON 4 | 91
b. Click the Test Connection button and confirm that you get the Successful connection message.

92 | Talend Data Stewardship - Participant Guide


c. Click OK and Finish.
The metadata appears in the repository in Metadata>Data Stewardship.

Create tasks for a campaign


Create a Job to read data from a file and generate a resolution task for each record.
1. CREATE A STANDARD JOB
Create a standard Job called CreateResolutionTasks
a. In the Repository, expand Job Designs.
Right-click Standard and select Create Standard Job.

b. The New Job window opens.


In the Name text box, enter CreateResolutionTasks
In the Purpose text box, enter Create curation tasks for library items

LESSON 4 | 93
c. Click Finish.
The Job opens in the work area.

2. ADD AND CONNECT COMPONENTS


Add three components: input, mapping, and output.
a. In the Repository, expand the Metadata folder.
Recall that the File delimited metadata describes the structure of the products data.
Drag and drop the Library_Products metadata into the work area. On the list of components, select tFileIn-
putDelimited.

94 | Talend Data Stewardship - Participant Guide


b. Click OK.
The component is added to the Job.

c. Add a tMap component and connect it to the input component using a Main row.

LESSON 4 | 95
d. To the right of the tMap component, add a tDataStewardshipTaskOutput component.

e. Connect the mapping component to the output component using a createTasksrow.

f. Click OK.

3. SAVE THE JOB


To save the current status of the Job, press Ctrl+S.
4. CONFIGURE THE OUTPUT COMPONENT

96 | Talend Data Stewardship - Participant Guide


Configure the component that creates a resolution task for each input record.
a. To open the Component tab, double-click the tDataStewardshipTaskOutput component.

b. For Property Type, select Repository.


d. To select metadata, click the ellipsis [...] button.

c. For the data stewardship metadata, select TDS_Connection and click OK.
The URL, Username, and Password boxes are filled in.

LESSON 4 | 97
d. Click the Find a campaign button.
Select Library Products Resolution and click OK.

e. The Campaign, Label, and Type boxes are filled in.

f. Make sure the Override enforcement of data model check box is selected.
Assume that some of the input data does not respect all constraints from the defined data model. In this case, if this
parameter is not selected, the tasks are not created, as they will be in error. You want to create tasks for all records

98 | Talend Data Stewardship - Participant Guide


and let data stewards resolve anomalies from the input dataset.

g. For the Assignee, select jclark@company.com.

5. COPY SCHEMA FROM OUTPUT TO MAPPING


The schema of the output component was filled in when a campaign was associated with it.
a. On the Component tab of the tDataStewardshipTaskOutput component, click the ellipsis [...] button next to Edit
Schema.

LESSON 4 | 99
b. The schema was populated for the output component only.

c. To populate the createTasks schema with the same fields, click the double left arrow button.
When asked if you want to transfer all columns to the input schema, click OK.

d. To close the schema editor, click OK.


e. To save the configuration, press Ctrl+S.
6. CONFIGURE MAPPING
Define mapping between the input file and data model extracted from the campaign definition.

100 | Talend Data Stewardship - Participant Guide


a. To open the mapping editor, double-click the tMap component.

b. To automatically map most of the fields, click Auto map!.

LESSON 4 | 101
c. Manually map the following fields:
Author>AUTHOR_DIRECTOR
Type>SUBCATEGORY
Ranking>POPULARITY

d. Click OK.
7. SAVE THE JOB
To save the Job, press Ctrl+S.
8. TEST THE JOB
To create resolution tasks, execute the Job.

102 | Talend Data Stewardship - Participant Guide


a. In the Run view, click the Run button.

b. Confirm that 1,698 rows were computed.

Exploring generated tasks


1. CONNECT TO DATA STEWARDSHIP
Connect as the business owner.
a. Enter these credentials:
Username: bowner@company.com
Password: talend

LESSON 4 | 103
b. Click Log In.
Click Campaigns.
Confirm that 1,698 tasks were created and zero are unassigned.

2. OPEN GENERATED TASKS


Recall that all tasks were assigned to data steward James Clark.
a. Click LIBRARY PRODUCTS RESOLUTION.

b. To see the created tasks, click the Assign tasks button.


By default, you see the unassigned tasks in the new state.

104 | Talend Data Stewardship - Participant Guide


There are no unassigned tasks because they were all assigned to James Clark.

c. To see all the tasks, click jclark@company.com.

While configuring the Job that creates tasks for a campaign, you may make an error (for example, you may not correctly designate
the assignee), and you will only notice this when viewing generated tasks.
This is why it is important to be able to delete all tasks and re-create them, if needed.
In the next section, you will create a Job that deletes all tasks, and then, to re-create them, you will again execute the CreateRes-
olutionTasks Job.

Delete tasks from a campaign


Create a Job that deletes all tasks from a campaign.
1. CREATE A STANDARD JOB
Create a standard Job.
a. From the integration perspective of Studio, In the Repository, expand Job Designs.
Right-click Standard and select Create Standard Job.

LESSON 4 | 105
b. The New Job window opens.
In the Name text box,enter DeleteTasks
In the Purpose text box, enter Delete curation tasks for library items

106 | Talend Data Stewardship - Participant Guide


c. Click Finish.
The Job opens in the work area.

2. ADD AND CONNECT COMPONENTS


Add two components: one that deletes tasks and one that displays deleted tasks.
a. Add a tDataStewardshipTaskDelete component.

b. Add a tLogRow component and connect it to the input component using a Main row.

3. SAVE THE JOB


To save the Job, press Ctrl+S.

LESSON 4 | 107
4. CONFIGURE THE INPUT COMPONENT
Configure the tDataStewardshipTaskDelete component to delete all previously created resolution tasks.
a. To open the Component tab, double-click the tDataStewardshipTaskDelete component.

b. For Property Type, select Repository.


To select metadata, click the ellipsis [...] button.

c. Select datastewardship:TDS_Connection and click OK.


The URL, Username, and Password boxes are filled in.

108 | Talend Data Stewardship - Participant Guide


d. Click the Find a campaign button.
Select LIBRARY PRODUCTS RESOLUTION and click OK.

e. When asked if you want to propagate changes, click Yes.

f. The Campaign, Label and Type boxes are filled in.

5. RUN THE JOB

LESSON 4 | 109
In the Run view, click the Run button.
After executing this Job, all tasks should disappear from the Library Products Resolution campaign.

6. CONNECT TO DATA STEWARDSHIP


Connect as the business owner.
a. Enter these credentials:
Username: bowner@company.com
Password: talend

110 | Talend Data Stewardship - Participant Guide


b. Click Log In.
The home page opens. Click Campaigns.
Confirm that all tasks were deleted; there are none left for this campaign.

7. RE-CREATE TASKS
In Talend Studio, open the CreateResolutionTasks Job.

LESSON 4 | 111
a. Run the Job.

b. Confirm that, as before, 1,698 rows were computed (you did not alter this Job).

Next step
Now you will act as data steward and learn how to assign and resolve resolution tasks using the Data Stewardship UI.

112 | Talend Data Stewardship - Participant Guide


Assigning and Resolving Tasks

Overview

Until now, you played the role of campaign owner. You created a data model, a resolution campaign, and tasks for that
campaign.

In this section, acting as a data steward, you will delegate tasks to another data steward and resolve your assigned tasks.
To resolve your tasks, you will analyze inconsistencies and use available functions to simplify your work and correct data
from several tasks at once.

When you realize task definitions are not correct or complete, you will update a constraint defined in the data model and
the definition of a semantic type in the dictionary service.
Finally, in order to understand the workflow configured when you defined the campaign, you will validate the corrected
tasks and follow task evolution from one state to another.

Delegate tasks and explore access rights


1. CONNECT TO TALEND DATA STEWARDSHIP
Connect as data steward James Clark.
a. Enter these credentials:
Username: jclark@company.com
Password: talend

Click Log In.


The Tasks page appears (remember, this is the only menu available for data stewards). Confirm that James has 1698

LESSON 4 | 113
assigned tasks.

2. OPEN THE TASKS


Click the LIBRARY PRODUCTS RESOLUTION campaign.

3. REORDER TASKS

114 | Talend Data Stewardship - Participant Guide


Sort the tasks by clicking the arrows above the columns.
In the CODE column, make sure the task list is sorted incrementally.

4. DELEGATE TASKS
Delegate the first 10 tasks to Dave Smith.
a. To open the delegate tasks menu, click the arrow on the left.
Confirm that Dave Smith has no assigned tasks.

b. On the task list, for the first task, click the Code attribute.
To select the first 10 tasks, press and hold the SHIFT key, then select the tenth task.

LESSON 4 | 115
c. On the menu to the left, hover your mouse over Dsmith@company.com to bring up a new icon.

d. Click the blue person icon.


Confirm that 10 tasks are assigned to Dave Smith and fewer are assigned to James Clark.

116 | Talend Data Stewardship - Participant Guide


NOTE:
You can also drag and drop tasks to assign them to a data steward.

5. VERIFY DATA STEWARD ACCESS RIGHTS


Recall that in the previous chapter, when defining the resolution campaign, some restrictions were defined for data stew-
ards.
Data stewards should have read-only access to the Code column, and the Price column should be hidden.
a. On the list of tasks, click the first one.
Try to modify the value for CODE. The field is not editable.

LESSON 4 | 117
b. To the right of the Format column, verify that the Price column does not appear (for data stewards, it is hidden in the
campaign definition).

Handle dataset inconsistencies


1. UNDERSTAND HOW THE QUALITY BAR WORKS
Look at the top of the Category column. Just below the column type, notice the three-color bar that displays the rows with cor-
rect data, empty fields, and incorrect data.
Green: Matches the data type and conforms to the validation rules
White: Empty
Red: Data does not match the cell format

118 | Talend Data Stewardship - Participant Guide


2. DISPLAY ONLY INVALID TASKS
In the Category column, click the red area to select only tasks with invalid values for the category.
The filter appears at the top of the task list, and there are 83 tasks with an invalid value.

3. MANUALLY CORRECT INVALID VALUES


Correct invalid values in the Category column.
a. To see the list of values allowed for this column, double-click the first invalid value in the column. This corresponds to
the list of allowed values defined in the data model.

LESSON 4 | 119
On the list, select Books.
The list is updated from 83 to 82 tasks because the modified task no longer contains an invalid value for the
b. Category attribute.

4. CORRECT INVALID VALUES IN BULK


As you can see on the task list, several tasks have the same invalid value for the category (for instance, Game instead of
Video Games).
To correct them all at once, you can use a cleansing function from the menu on the right.

120 | Talend Data Stewardship - Participant Guide


a. Select the Category column and view the list of values in the Chart section on the lower right side of the window.

b. On the upper right, on the list of available functions, locate the replace function.

c. Select Replace the Cells that Match and configure the function like this:
Initial value: Game
Replacement value: Video Games

LESSON 4 | 121
d. Click SUBMIT.
All the Game values are replaced with Video Games, and only two tasks with invalid values remain in the Category
column.
In a real situation you would, of course, finish fixing all the records.

5. REMOVE THE FILTER

122 | Talend Data Stewardship - Participant Guide


To remove the filter, click the cross icon (X).
This displays the complete task list.

6. CORRECT INVALID VALUES FOR THE YEAR


Filter tasks with invalid year values, correct one of them, and confirm that the value changes to valid.
a. To select tasks with an invalid year value, select the YEAR column and click the red area.

LESSON 4 | 123
b. Recall that the YEAR column is a semantic type that should be either a year or an interval of years.
Update the value for the task with Code 153 from 13th century to 1201-1300.

c. To process the change, press ENTER.


Confirm that the value is no longer on the list of invalid values for the YEAR column.

124 | Talend Data Stewardship - Participant Guide


d. To go back to the complete list of tasks, remove the filter (In a real case you would finish fixing the records).

Update the data dictionary and verify the impact


You will now update the definition of an existing semantic category and verify that the quality bar is subsequently updated.
Recall that when defining the data model, you chose the language semantic category as the data type for the Language attribute.
After viewing the data, you would like to add more allowed values to the list of predefined languages allowed for this semantic cat-
egory.
To do this, you will use the command line tool from the dictionary service (in Chapter 2, you used this to create a new semantic cat-
egory).

1. VERIFY INVALID VALUES FOR THE LANGUAGE ATTRIBUTE


Explore invalid values for the Language attribute.

LESSON 4 | 125
a. In the Web UI, select the LANGUAGE column and click the red area.

b. You want to add Akkadian and Old Norse to the list of allowed values for the language semantic type.
2. OPEN THE LANGUAGE SEMANTIC TYPE
a. Log out and log back in as the campaign owner with these credentials:
Username: bowner@company.com
Password: talend

126 | Talend Data Stewardship - Participant Guide


b. From the left menu, select Semantic Types

c. The semantic types page appears.Click the magnifier icon

LESSON 4 | 127
d. Filter the semantic types by entering language.

e. Click the Language title of the first result.

3. ADD THE NEW SEMANTIC TYPE VALUES


Add Akkadian and Old Norse items in the Language semantic type.
a. Click the [ + ] icon

b. In the New entry text box, enter Akkadian then validate by clicking the first tick icon (you can also press the enter key).

128 | Talend Data Stewardship - Participant Guide


b. Repeat the operation for Old Norse.

c. Click the cross icon

LESSON 4 | 129
d. Click the magnifier icon and enter the name of the items you added. (They should appear in the Values list)

e. At the bottom of the screen, click the Save and Publish button.

4. VERIFY THE SEMANTIC CATEGORY MODIFICATION IMPACT


After the semantic category update, the new values should be allowed for the Language attribute in the Resolution cam-
paign.
a. Log out and log back in as James Clark with these credentials:
Username: jclark@company.com
Password: talend
b. To open the tasks, click the LIBRARY PRODUCTS RESOLUTION campaign.

130 | Talend Data Stewardship - Participant Guide


c. To select tasks with an invalid value for the language, select the Language column and click the red area.

d. Notice that the newly added entries, Akkadian and Old Norse are now allowed for the Language semantic type (they
are no longer on the list of invalid values). In a real situation you would finish fixing all the invalid values.
e. Remove the filter.

Validate tasks
1. MARK A TASK AS READY FOR VALIDATION
Mark the first task as ready for validation.
a. Select the first task (code 11).
If the lock icon to the left of the row is white on a gray background, the task has no inconsistencies related to the
defined data model.

b. Click the lock icon.

LESSON 4 | 131
The task turns green and is marked ready for validation, and the next task is automatically selected.

2. RESOLVE AN INVALID TASK AND MARK IT READY FOR VALIDATION


Resolve a task with an invalid language value.
a. Select the task with code 17, which should have an invalid language value.

b. The lock icon is now on a red background. This means the task has inconsistencies related to the defined data model,
so it cannot be marked as ready for validation.
Update the language value to a valid one: English.

132 | Talend Data Stewardship - Participant Guide


c. Now that you have resolved the inconsistency, the lock is white and the task can be marked as ready for validation.
To validate, click the lock icon.

NOTE:
The objective of a resolution task is to ensure data integrity and validity. Therefore, you cannot validate a task that has
an invalid value.

3. DELETE A TASK
In your dataset, you have a record with empty name and author fields. Because this record is so low in quality, the data stew-
ard decides to delete it.
a. To select the task with the invalid name, click the red area in the NAME column).
Notice that the name and author fields are empty.

b. On the functions menu on the right, select the ROW tab.

LESSON 4 | 133
c. Click Remove tasks.

d. Click Submit.
The task is deleted (no task has an invalid value for the NAME column, and the total number of tasks decreases by
one).

e. Remove the filter.


The quality bar for the NAME column is entirely green.

4. VALIDATE YOUR CHOICES


You marked two tasks as ready for validation, but they have not been validated.
In the upper right corner, click the Validate Choices button.
a.

134 | Talend Data Stewardship - Participant Guide


b. Notice that the number of tasks for James Clark has decreased from 1687 to 1685 (two tasks were validated).
This means the tasks moved from the New to the Resolved state.

5. DISPLAY THE RESOLVED TASKS
Recall from the campaign definition that the data stewards can access tasks in the New state, while the business owner has
access to tasks in the Resolved state.
a. Log out, then log back in as the business owner with these credentials:
Username: bowner@company.com
Password: talend
b. Click the campaign from the Tasks menu.
By default, the resolved and unassigned tasks are displayed.

LESSON 4 | 135
You can see the two tasks resolved by James Clark.

Next step
You have almost finished this section. Time for a quick review.

136 | Talend Data Stewardship - Participant Guide


Review
In this lesson, you created a Talend Studio project and learned how to build a Job that creates tasks for a campaign in Talend Data
Stewardship. To do this, you configured the tDataStewardshipTaskOutput component.

You created a second Job to delete all tasks from a campaign (in case you make a mistake and want to re-create all tasks with dif-
ferent parameters). To delete tasks from a campaign in Data Stewardship, you configured the tDataStewardshipTaskDelete com-
ponent.
After creating a set of resolution tasks for a dataset from SAP, you connected to the Data Stewardship Web UI and explored the gen-
erated tasks.
You learned how to handle tasks and delegate them to data stewards.
You saw that you have a couple of data curation options, which can be useful when trying to resolve data inconsistencies.
Finally, you resolved some tasks, validated them, and verified that their states changed according to the two-state workflow defined
for the resolution campaign.

LESSON 4 | 137
Intentionally blank
LESSON 5
Grouping Campaign
This chapter discusses:

Concepts 140
Overview 143
Creating a Data Model 144
Creating a Grouping Campaign 152
Creating Grouping Tasks 161
Handling Grouping Tasks 177
Review 192
Concepts

140 | Talend Data Stewardship - Participant Guide


LESSON 5 | 141
142 | Talend Data Stewardship - Participant Guide
Overview
In the previous lesson, you created a resolution campaign, generated tasks for it by using a library products dataset exported from
SAP, and saw how data stewards can resolve and validate resolution tasks.
Recall that when creating a campaign in Talend Data Stewardship, several campaign types are available: resolution, grouping, mer-
ging, and arbitration.
In this lesson, you will work on a grouping campaign.
The management decided to work on the inventory of the disc library.
Some records are missing information. Others are duplicates or exist in various formats.
The purpose of the grouping campaign is to confirm that submitted records are indeed linked to the same album and artist.
After creating the campaign, you generate a task from a CSV file using Studio. As a campaign owner, you dispatch the tasks. Finally,
as a data steward, you resolve the tasks.

Objectives
After completing this lesson, you will be able to:
Create a new data model
Define a grouping campaign
Build a Job that creates tasks for the campaign
Bulk assign tasks
Modify task service level agreements (SLA)
Handle tasks

Next step
You are ready to create a new data model.

LESSON 5 | 143
Creating a Data Model

Overview

You are ready to work on a data model that stores CDs and LPs with details on their format, release year, and information
source.
In this section, you create an associated data model to prepare for the next exercise.

Create a data model


1. CONNECT TO THE DATA STEWARDSHIP WEB UI
Connect as a business owner by entering these credentials:
Email: bowner@company.com
Password: talend
Click SIGN IN.
The home page appears.

2. ADD A DATA MODEL


Add a data model for library items.
a. On the menu, click DATA MODELS.

b. Click the ADD DATA MODEL button.

144 | Talend Data Stewardship - Participant Guide


c. The data model form appears.

3. DEFINE THE DATA MODEL


In the NAME text box, enter Record Inventory, and in the DESCRIPTION text box, enter a short description.

4. DEFINE THE FIRST ATTRIBUTE


By default, below the description, the data model contains an attribute.
a. Designate the Identifier as CODE and the Name as Code.
b. Change the data type to Integer.
c. Deactivate the ALLOW EMPTY VALUES slide button to make the identifier mandatory.

LESSON 5 | 145
5. ADD AN OPTIONAL ITEM ALBUM ATTRIBUTE
To add a second attribute, on the menu on the left, click the ADD ATTRIBUTE button.
The attribute details form appears in the main frame.
Define it like this:
Identifier: ALBUM
Name: Album
Data type: Text
Make the field optional by leaving the ALLOW EMPTY VALUES slide button activated.

146 | Talend Data Stewardship - Participant Guide


6. ADD AN OPTIONAL ARTIST ATTRIBUTE
Using the same method, add an attribute used for storing the name of the album artist.
Define it like this:
Identifier: ARTIST
Name: Artist
Data type: Text
Leave the ALLOW EMPTY VALUES slide button activated.

7. ADD AN OPTIONAL ITEM CATEGORY ATTRIBUTE


Add an attribute for storing the item category with certain allowed values.
Define it like this:
Identifier: YEAR
Name: Year
Data type: Text
Leave the ALLOW EMPTY VALUES slide button activated.

LESSON 5 | 147
8. ADD AN OPTIONAL ITEM GENRE ATTRIBUTE
Add an attribute for storing the library item subcategory (for example, science fiction or drama)
Define it like this:
Identifier: GENRE
Name: Genre
Data type: Text
Leave the ALLOW EMPTY VALUES slide button activated.

148 | Talend Data Stewardship - Participant Guide


9. ADD AN OPTIONAL ITEM FORMAT ATTRIBUTE
Add an attribute for storing the record item format (for example, CD or vinyl)
Define it like this:
Identifier: FORMAT
Name: Format
Data type: Text
Leave the ALLOW EMPTY VALUES slide button activated.

LESSON 5 | 149
10. VERIFY THE DATA MODEL STRUCTURE
Confirm that the data model looks like this:

11. SAVE THE DATA MODEL


Save the data model.

150 | Talend Data Stewardship - Participant Guide


a. In the lower right corner, click the ADD DATA MODEL button.

b. The data model appears on the list of available data models.

Next step
You are ready to create a campaign based on the defined data model.

LESSON 5 | 151
Creating a Grouping Campaign

Overview

In this lesson, you create a grouping campaign that enables you to decide which records should be gathered in a common
group.

Create a grouping campaign


1. ADD A CAMPAIGN
Create a new campaign.
a. Log in as the business campaign owner.

b. On the Campaigns page, click the Add Campaign button.


The campaign form appears.

2. DEFINE THE GENERAL CHARACTERISTICS


Specify the campaign name, description, and type.
a. Enter these parameters:
Name: LIBRARY PRODUCTS GROUPING
Description: Group records from a data set

152 | Talend Data Stewardship - Participant Guide


b. For TYPE, select GROUPING.
The form is updated.

3. DEFINE THE GROUPING CAMPAIGN QUESTION


Specify the campaign question and possible answers.

a. Enter these parameters:


Question: Are these items the same album?
1: Yes
2: No
3: Not sure

LESSON 5 | 153
4. WORK WITH SLA
By default, tasks have an infinite amount of time in which to be resolved. The campaign owner can define a maximum num-
ber of days to resolve tasks. When the time is exceeded, the task will appear in red.
Note: This feature uses a simple day count by default and weekends and holidays are considered the same as working
days. However, when generating the tasks using a Studio Job, SLA can also be defined. Doing so, more advanced cal-
culation rules or external calendars can be leveraged.

154 | Talend Data Stewardship - Participant Guide


a. Activate the Enable task resolution delay slide button.

LESSON 5 | 155
b. Enter 5 in the Delay text box.

5. DEFINE THE CAMPAIGN OWNERS


By default, the campaign owner creating the campaign is assigned to it.
Add the technical campaign owner as you did for the previous campaigns. This role is used for adding tasks to the campaign
using Talend Studio.
a. Click the Add a campaign owner text box.
b. Select towner@company.com.

6. DEFINE ROLES FOR THE GROUPING CAMPAIGN


Define two roles for the grouping campaign: data steward and task validator.
a. Next to STEWARDS, click the Edit icon and change the role name to DATA STEWARD(uppercase). To save, click
the check mark.

156 | Talend Data Stewardship - Participant Guide


b. Click the Add a steward text box and select James Clark and Dave Smith.

c. Click the ADD A ROLE button.


d. click the Edit icon and change the role name to TASK VALIDATOR.(uppercase)

e. To save, click the check mark.

LESSON 5 | 157
f. Click the Add a steward text box and select the business campaign owner, who will be the task validator.

7. DEFINE THE DATA MODEL FOR GROUPING TASKS


Select the Record Inventory data model and define access rights.
a. Click in the Select a data model text box and select Record Inventory.

158 | Talend Data Stewardship - Participant Guide


b. The form is updated with the data model. Do not restrict access rights for this campaign.

8. DEFINE THE CAMPAIGN WORKFLOW


You must define the steps for resolving the campaign tasks.
a. To grant the data steward access to the campaign tasks in the New state, below NEW, click the Add a role box and
select DATA STEWARD.

LESSON 5 | 159
b. To grant the task validator access to the campaign tasks in the Resolved state, below RESOLVED, click the Add a
role text box and select TASK VALIDATOR.

9. FINISH CREATING THE CAMPAIGN


Save the campaign.
a. In the lower-right corner, click the ADD CAMPAIGN button.

b. The campaign appears on the list.

Next step
You are ready to learn how to create grouping tasks in Talend Studio.

160 | Talend Data Stewardship - Participant Guide


Creating Grouping Tasks

Overview

In this exercise, you create a Job that reads an aggregated file from SAP and the legacy product catalog and creates a
grouping task for each record in that file.
This diagram shows how it works:

Create metadata
1. CREATE FILE DELIMITED METADATA
a. In the Repository, expand Metadata.
b. Right-click File delimited and select Create file delimited.

2. CONFIGURE FILE DELIMITED METADATA


Configure metadata for the library products file exported from SAP.

LESSON 5 | 161
a. In the the Name text box, enter Album_Catalog and enter a purpose and description.

b. Click Next.
c. In the C:/StudentFiles/TDS folder, select the albums.csv file.

162 | Talend Data Stewardship - Participant Guide


d. For Format, select WINDOWS.

LESSON 5 | 163
e. Click Next, select the Set heading row as column names check box, and click Refresh Preview.

f. Click Next.
Change Name to Albums_metadata.

164 | Talend Data Stewardship - Participant Guide


For all the fields, leave Length empty. For column Type, except for CODE select String.

LESSON 5 | 165
g. Click Finish.
The metadata appears in the repository in Metadata>File delimited.

Creating tasks for a campaign


Create a Job to read data from a file and generate a resolution task for each record.
1. CREATE A STANDARD JOB
Create a standard Job called CreateGroupingTasks
a. In the Repository, expand Job Designs.
Right-click Standard and select Create Standard Job.

b. The New Job window opens.


In the Name text box, enter CreateGroupingTasks
In the Purpose text box, enter Prepare grouping tasks for library records
Enter a comment of your choice in the Description text box.

166 | Talend Data Stewardship - Participant Guide


c. Click Finish.
The Job opens in the work area.

2. ADD AND CONNECT COMPONENTS


Add three components: input, mapping, and output.

LESSON 5 | 167
a. In the Repository, expand the Metadata folder.

Drag the Album_Catalog metadata into the work area. On the list of components, select tFileInputDelimited.

b. Click OK.
The component is added to the Job.

168 | Talend Data Stewardship - Participant Guide


c. Add a tMap component and connect it to the input component using a Main row.

d. Drag the TDS_Connection metadata to the right of the tMap component in the work area. Then select tDataSte-
wardshipTaskOutput.

e. Click OK.

LESSON 5 | 169
f. To open the Component tab, double-click the tDataStewardshipTaskOutput component.

g. Click the Find a campaign button, select LIBRARY PRODUCTS GROUPING, and click OK.

170 | Talend Data Stewardship - Participant Guide


h. Click the Advanced settings tab.

i. Set the drop-down menus as in the following screenshot:

j. Connect tMap to TDS_Connection using a Main row and name it grouping_tasks.

k. Click OK

LESSON 5 | 171
l. Click Yes when asked if you want to get the schema of the target component.

The Job looks like this:

3. SAVE THE JOB


To save the current status of the Job, press Ctrl+S.
4. CONFIGURE MAPPING
Define the mapping between the input file and data model extracted from the campaign definition.

172 | Talend Data Stewardship - Participant Guide


a. To open the mapping editor, double-click the tMap component.

b. To automatically map most of the fields, click Auto map!.

LESSON 5 | 173
c. Manually map the following fields:
GID>TDS_GID

174 | Talend Data Stewardship - Participant Guide


Source>TDS_SOURCE

d. Click OK.
5. SAVE THE JOB
To save the Job, press Ctrl+S.
6. TEST THE JOB
To create resolution tasks, execute the Job.

LESSON 5 | 175
a. In the Run view, click the Run button.

b. Confirm that 227 rows were computed.

Next step
As a campaign owner, you are ready to assign tasks, act as a data steward, and handle grouping tasks.

176 | Talend Data Stewardship - Participant Guide


Handling Grouping Tasks

Overview

You created a grouping campaign and generated tasks for it. Now you can act as a campaign owner, dispatching tasks to
data stewards. Once done, you will also act as a data steward, handling some arbitration tasks. Unlike the resolution cam-
paign, handling tasks in a grouping campaign does not consist in correcting records but answering a question about a
group of records.

Handle grouping tasks


1. CONNECT TO TALEND DATA STEWARDSHIP
If you have not done so already, log in as the campaign owner and go to Campaigns.
a. Enter these credentials:
Username: bowner@company.com
Password: talend
b. Click Sign In.
The Tasks page appears.
2. OPEN THE CAMPAIGN
To display the campaign list, click Campaigns.
Verify that 102 tasks are unassigned.

3. DISPLAY THE UNASSIGNED TASKS

LESSON 5 | 177
To display the tasks you created earlier, click LIBRARY PRODUCTS GROUPING.

4. DISPLAY THE DUE DATE


When you created the grouping campaign, you defined a task resolution delay of five days.
You can control the actual due date directly from the list.

178 | Talend Data Stewardship - Participant Guide


Right scroll to make the Due date column appear. The date format is YYYY-MM-DD hh:mm:ss

5. CHANGE THE DUE DATE


There was a plan change and half of the tasks must be completed in only two days.

LESSON 5 | 179
a. Select the first half of the tasks: click the first one, press and hold SHIFT, and select task 51.

b. Confirm that you selected 51 items.


On the Arbitration menu, click Row.

180 | Talend Data Stewardship - Participant Guide


c. Click the [+] to the right of Define due date.

d. Update the Due date text box by adding two days to the current date, then click the SUBMIT button.

LESSON 5 | 181
e. Verify that the due date has been updated in the DUE DATE column of the list for only the first 51 tasks.

6. BULK ASSIGN TASKS


David Smith will handle all tasks on the current list.
It is possible to proceed without having to manually select all the rows.

182 | Talend Data Stewardship - Participant Guide


a. In the menu on the right, click the [-] to collapse the Define due date menu.

b. Click the [+] to the right of Assign tasks to expand the menu.

LESSON 5 | 183
c. Select All tasks on the Selection drop-down menu and click the SUBMIT button.

The unassigned tasks list is empty.

7. DISPLAY THE NEW TASKS


Acting as David Smith, display the new tasks.
a. At the top of the screen, click bowner@company.com and Logout.

b. Log in with the following credentials:


Login name: dsmith@company.com

184 | Talend Data Stewardship - Participant Guide


Password: talend

c. From the homepage, click LIBRARY PRODUCTS GROUPING


The new tasks appear on a list.

LESSON 5 | 185
8. MARK A TASK AS READY FOR VALIDATION

186 | Talend Data Stewardship - Participant Guide


By default, the first task is selected. In answer to the question Are these items the same album?, click No.

The task is green and ready for validation, and the choice is visible from the Arbitration column.

LESSON 5 | 187
9. BULK RESOLVE TASKS USING THE MENU ON THE RIGHT

188 | Talend Data Stewardship - Participant Guide


a. Select four unresolved tasks: click the first one, press and hold SHIFT, and select the last one.

b. Click the Row button.

LESSON 5 | 189
c. Click the [+] to the right of Arbitrate tasks to expand the menu.

d. Select Not sure on the Arbitration drop-down menu and click the SUBMIT button.

10. VALIDATE THE GROUPING TASKS

190 | Talend Data Stewardship - Participant Guide


Click the Validate choice button.

Next step
You have almost finished this section. Time for a quick review.

LESSON 5 | 191
Review
In this lesson, you created a grouping campaign and learned how to handle a scenario in which data stewards must validate the cor-
relation of task records by answering a question.
You learned how to:
Create a grouping campaign
Create and assign grouping tasks in Talend Studio
Define task SLA
Bulk assign tasks to data stewards
Resolve grouping tasks in Data Stewardship

Additional information
Talend knowledge base:
Adding a grouping campaign
Handling grouping tasks

192 | Talend Data Stewardship - Participant Guide


LESSON 6
Merging Campaign
This chapter discusses:

Concepts 194
Overview 199
Creating a Merging Campaign 200
Creating Merging Tasks 209
Resolving Merging Tasks 231
Querying Tasks 251
Review 263
Concepts

A merging campaign enables you to choose fields for


determining a match between records and then
merge the records. Once data is deduplicated, the
unique list of product items can be uploaded to any
product referential system.

194 | Talend Data Stewardship - Participant Guide


When several records coming from various sources
exist for a same item, the matching process uses them
to build a unique record.

In an MDM project, this is often done to identify


records from different sources representing the same
real-world entity. This operation leads to a “master
record.”
Deterministic:
Exact match of common identifiers (for example,
national IDs, system IDs)
Exact comparison between other fields (generally not
reliable)
Probabilistic:
Several field values are compared between two
records and each field is assigned a weight that indic-
ates how closely the two field values match.
The sum of individual field weights indicates the like-
lihood of a match between two records.

Record linkage of large volumes of data is a com-


putationally-expensive task. Blocking can be used in
record linkage algorithms to speed up performance by
reducing the number of comparisons needed to match
records.

LESSON 6 | 195
196 | Talend Data Stewardship - Participant Guide
Matching functions
Soundex: phonetic algorithm (English) that indexes
strings with a code
For example: “Hello”→“H400”
Advice: Soundex yields many false positives; better
using Double Metaphone
Metaphone: Phonetic algorithm that encodes strings
Double Metaphone: New version of Metaphone
Levenshtein (edit distance): Computes minimum num-
ber of edits (insertions, deletions, substitutions)
needed to transform one string into the other
Jaro: Counts number of matched characters and
transpositions
Jaro-Winkler: same as Jaro but with more importance
given to beginning of word
Best suited for short strings such like names of people
Q-grams: slides a window of length q over the string
and compares number of matching q-grams over total
number of q-grams
Trigrams (q=3) is currently implemented
For example: « match » → « **m », « *ma », « mat », « 
atc », « tch », « ch* », « h** »

Match threshold: If the group probability is above or


equal to this value, records match
Confident match threshold: Greater than or equal to
this value, increased confidence in matching group
quality

LESSON 6 | 197
198 | Talend Data Stewardship - Participant Guide
Overview
In the previous lessons, you created a resolution and a grouping campaign.
In this lesson, you will create a merging campaign, generate tasks for it, and resolve and validate the tasks by implementing a three-
state validation workflow.
Imagine that a company needs to merge data from two systems: SAP and the legacy product catalog. The merging campaign lets
you choose fields to determine a match between records, and then merge the records. Once data is deduplicated, the unique list of
product items can be uploaded to any product referential system.
In this lesson, you receive a dataset containing library products from two sources, which contain unique products as well as duplic-
ates. Some are clearly duplicates, but others are just "suspects" that a data steward needs to confirm as duplicates.
To facilitate extraction of a set of unique library products from this dataset, you will:
Create a merging campaign in Data Stewardship
Create a Job that computes the input dataset, creates a list of unique items, creates a list of definite duplicates, and gen-
erates tasks for suspects
Resolve and validate merging tasks
In order to retrieve corresponding golden records (unique records calculated from the merge of two suspect records), build
a Job that consumes tasks validated in Data Stewardship

Objectives
After completing this lesson, you will be able to:
Define a merging campaign
Build a Job that implements record matching and generates tasks for suspects
Resolve and validate merging tasks
Build a Job that consumes resolved tasks

Next step
You are ready to define a merging campaign.

LESSON 6 | 199
Creating a Merging Campaign

Overview

In this lesson, you will create a merging campaign that enables you to decide which fields to use to determine a match
between suspect records, and then merge records to create a golden record.

Create a merging campaign


1. ADD A CAMPAIGN
Create a new campaign.
a. Log in as the business campaign owner.

b. On the Campaigns page, click the Add Campaign button.


The campaign form appears.

2. DEFINE THE GENERAL CHARACTERISTICS


Specify the campaign name, description, and type.
a. Enter these parameters:
Name: LIBRARY PRODUCTS MERGE
Description: Merge suspect duplicate records from a library items data set

200 | Talend Data Stewardship - Participant Guide


b. For TYPE, select MERGING.
The form is updated.

3. DEFINE THE CAMPAIGN SOURCES


a. To define the merging campaign sources and identify the trust score for each, next to Define sources, click the
switch.
b. Name the first source SAP, with a trust score of 90%.
c. Assume that the SAP source is more trustworthy than the legacy product catalog.
Name the second source LEGACY PRODUCT CATALOG, with a trust score of 70%.

LESSON 6 | 201
4. DEFINE THE CAMPAIGN OWNERS
By default, the campaign owner creating the campaign is assigned to it.
a. Add a second, technical campaign owner; this role will be used for adding tasks to the campaign using Talend Studio.
b. Click in the Add a campaign owner text box and select towner@company.com.

5. DEFINE ROLES FOR THE MERGING CAMPAIGN


Define two roles for the merging campaign: data steward and task validator.

202 | Talend Data Stewardship - Participant Guide


a. Next to STEWARDS, click the Edit icon and change the role name to Data Steward(uppercase).

b. To save, click the check mark.


c. Click the Add a steward text box and select James Clark and Dave Smith.

d. Add a second role named Task Validator(uppercase).

e. Click the Add a steward text box and select the business campaign owner, who will be the task validator.

6. DEFINE THE DATA MODEL FOR MERGING TASKS


Select the Library Products data model and define the survivorship rules for each attribute.

LESSON 6 | 203
a. Click in the Select a data model text box and select Library Products.

b. The form is updated with the data model.


You can also define access rights for the two roles for this campaign, and you can define the survivorship rules for
each field.
By default, both roles have editing rights on all attributes, and the default survivorship rule is First valid (if no valid
field value is found by the First valid survivorship rule, the golden value is set to null).

c. Do not restrict access rights for this campaign.


7. DEFINE THE SURVIVORSHIP RULES
Set the default survivorship rule to Most common and enter these rules in the respective fields:
CODE: SAP
PRICE: LEGACY PRODUCT CATALOG
POPULARITY: SAP
AVAILABILITY: SAP
SOURCE: Most trusted

204 | Talend Data Stewardship - Participant Guide


LESSON 6 | 205
8. DEFINE THE CAMPAIGN WORKFLOW
Now you need to define the steps for resolving the campaign tasks.
a. To add a validation step to the workflow (necessary for resolving campaign tasks), next to Enable validation step,
click the switch.
A step is added to the workflow.

b. To grant the data steward access to the campaign tasks in the New state, below NEW, click in the Add a role box and
select DATA STEWARD.

206 | Talend Data Stewardship - Participant Guide


c. To grant the task validator access to the campaign tasks in the To validate state, below TO VALIDATE, click in the
Add a role box and select TASK VALIDATOR.

d. To grant the task validator access to the campaign tasks in the Resolved state, below RESOLVED, click in the Add a
role box and select TASK VALIDATOR.

9. FINISH CREATING THE CAMPAIGN


Save the campaign.

LESSON 6 | 207
a. In the lower right corner, click the ADD CAMPAIGN button.

b. The campaign appears on the list.

Next step
Now you will learn how to create merging tasks in Talend Studio.

208 | Talend Data Stewardship - Participant Guide


Creating Merging Tasks

Overview

In this section, you will use Talend Studio to generate tasks for the merging campaign you created.
You will create a Job that:
Reads a file containing product items exported from SAP and the legacy product catalog
Defines matching rules that will be applied to match records and group the records by similarity
Create a list of unique items and a list of clear duplicates, then generate tasks only for suspect duplicates
This diagram shows how it works:

Build a Job that creates merging tasks


1. DUPLICATE THE EXISTING JOB
Copy and rename the existing Job as CreateMergingTasks.
a. Open the TDS project from Studio.
b. Right-click the CreateResolutionTasks Job and select Duplicate.

LESSON 6 | 209
c. In the Input new name box, enter CreateMergingTasks.

210 | Talend Data Stewardship - Participant Guide


d. Click OK.
To open the Job in the design area, double-click it.

2. UPDATE THE INPUT COMPONENT CONFIGURATION


Update the input file name for the tFileInputDelimited component.
a. To open the Component tab, double-click the Library_Products input component.

b. Change the Property Type from Repository to Built-In.

LESSON 6 | 211
c. Change the file name to C:/StudentFiles/TDS/Library_data_Sources.csv.
This file contains product items from both SAP and the legacy product catalog.

d. To save the updated input component, press Ctrl+S.


3. CONFIGURE THE OUTPUT COMPONENT
Configure the output component to create tasks for the merging campaign instead of the resolution campaign.
a. To open the Component tab, double-click the tDataStewardshipTaskOutput component.

b. Click the Find a campaign button.


Select the "LIBRARY PRODUCTS MERGE" campaign.

212 | Talend Data Stewardship - Participant Guide


c. If it is not selected, select the Override enforcement of data model check box.

d. Click the ellipsis [...] button next to Edit schema, and to transfer the schema to the mapping component in the
schema window, click the double left arrow button.
When asked for validation, click OK.

e. To close the window, click OK.


f. Click the Advanced settings tab
g. Set the six dropdown menus like the following screenshot.

LESSON 6 | 213
h. To save the updated Job, press Ctrl+S.
4. ADD A MATCHING COMPONENT
To group similar records, you need to add a tMatchGroup component.
a. Between the input and mapping components, delete the Main row.

b. Between the input and mapping components, add a tMatchGroup component.

214 | Talend Data Stewardship - Participant Guide


c. Connect the Library_Products input component to the tMatchGroup component using a Main row.

d. Save the Job by pressing Ctrl+S.


5. SET UP THE MATCHING ALGORITHM
Start configuration of the matching component by selecting the matching algorithm.
a. To open the Component tab, click the tMatchGroup component.

b. Leave the Matching Algorithm designated as Simple VSR.

NOTE:
Two matching algorithms are available:
Simple VSR and T-Swoosh.

T-Swoosh is a sequential entity resolution algorithm that allows you to match and preserve records in a
sequential way. The algorithm calculates a golden record for each group of records by using the survivorship
rules defined in the configuration of this component.
The Simple VSR algorithm takes a set of records as input, and groups similar encountered duplicates accord-
ing to defined match rules. It compares pairs of records and assigns them to groups. The first record pro-
cessed in each group is the master (no survivorship rules are defined for this algorithm).

LESSON 6 | 215
At this stage, you want to group only similar records without calculating the golden record. You want to let Data
Stewardship calculate the golden record according to the survivorship rules defined when the merging campaign was
created. As a result, in this exercise you will configure the Simple VSR algorithm.

6. CONFIGURE MATCHING RULES


Define matching rules for grouping similar records.
a. To open the Configuration Wizard, on the Component tab, click the ellipsis [...] icon next to PREVIEW.

b. On the Match Rule panel, define the matching keys.


To add three input keys, click the plus symbol [ + ] three times, then configure like this:
Input Key: Author; Matching Function: Jaro-Winkler; Confidence Weight: 80; Null Match ALL
Input Key: Format; Matching Function: Jaro-Winkler; Confidence Weight: 95; Null Match ALL
Input Key: Name; Matching Function: q-grams; Confidence Weight: 85; Null Match Null

216 | Talend Data Stewardship - Participant Guide


c. Set the Match Threshold at 0.88.
This means records matching with a score lower than 0.88 are considered unique, while those matching with a score
higher than 0.88 are grouped as potential duplicates.

d. Add a blocking column.

On the Blocking Selection panel, click the plus symbol [ + ] and select the Category column.
The goal of the blocking column is to create groups of records that limit the number of comparisons during prob-
abilistic matching.
In this case, you saw that the Category attribute contains very clean data and allows you to create smaller groups for
the matching computation.

LESSON 6 | 217
e. Click OK
f. To save the Job, press Ctrl+S.
7. DEFINE A SEPARATE OUTPUT FOR THE MATCHING COMPONENT
a. On the Component tab, click Advanced settings.

b. Select the Separate output check box.


By using this option, you separate potential duplicates into duplicate and suspect groups.
Set the Confident match threshold at 0.93.
With this configuration, records matching with a score between 0.88 (the previously defined match threshold) and
0.93 (the confident match threshold) are suspects, while records matching with a score higher than 0.93 are duplic-
ates.

218 | Talend Data Stewardship - Participant Guide


8. SAVE THE JOB
To save the Job, press Ctrl+S.
9. CONNECT COMPONENTS
a. Right-click the tMatchGroup component, select Row>Suspects, and connect to the tMap component.

LESSON 6 | 219
b. The Job looks like this:

10. CONFIGURE THE MAPPING COMPONENT


a. To open the mapping wizard, double-click the tMap component.

b. To map the input fields to the createTasks output table, replace row1 with row2. Then map the matching specific
components:

TDS_GID: GID
TDS_MASTER: false

220 | Talend Data Stewardship - Participant Guide


TDS_SOURCE: Source

c. To close the mapping wizard, click OK.


d. To save the Job, press Ctrl+S.

NOTE: The TDS_MASTER field must be set to false because no master record is generated for suspects. Recall that
when using the SimpleVSR algorithm, no golden record was calculated. Data Stewardship takes care of creating it by
following the survivorship rules defined in the merging campaign.

Handling unique records


You will now enrich the Job to write unique records to a separate file.
1. ADD AN OUTPUT COMPONENT
To write data to an output file, add a tFileOutputDelimited component.
a. Above the tMap component, add a tFileOutputDelimited component.

LESSON 6 | 221
b. Connect it to the tMatchGroup component using a Uniques row.

2. CONFIGURE THE OUTPUT COMPONENT


Configure the tFileOutputDelimited component to write unique records to a separate file.
a. To open the Component tab, click the tFileOutputDelimited component.

b. Change the File Name to C:/Temp/uniques.csv

222 | Talend Data Stewardship - Participant Guide


c. Select the Include Header check box.

3. SAVE THE JOB


To save the Job, press Ctrl+S.

Handle confident matches


You will now enrich the Job to handle confident matches.

Because the tMatchGroup component does not calculate a golden record for confident matches (remember that you used
the Simple VSR matching algorithm), you first need to calculate a golden record for the groups of confident matches. Then you will
write these golden records to an output file.
1. ADD A SURVIVE FIELDS COMPONENT
The tSurviveFields component receives a flow and merges it based on one or more columns.
a. Below the tMap component, add a tSurviveFields component.

LESSON 6 | 223
b. Connect it to the tMatchGroup component using a matches row.

2. ADD AN OUTPUT COMPONENT


In order to write the results of the matches to a separate output file, add another tFileOutputDelimited component.
a. Copy the tFileOutputDelimited_1 component and paste it to the right of the tSurviveFields component.

b. Connect it to the tSurviveFields component using a Main row and answer yes when asked if you want to get the
schema of the target component.

224 | Talend Data Stewardship - Participant Guide


c. Verify the Job looks like this:

3. CONFIGURE THE SURVIVE FIELDS COMPONENT


The tSurviveFields component allows you to define the survivorship rule for each attribute while building a master record for
each group of confident matches.
a. To open the Component tab, click the tSurviveFields component.

b. Click the Sync columns button.


c. Add a Key column.

On the Key panel, click the plus symbol [+].


Enter Code in the Ouput column and GID (group ID) in the input column.

LESSON 6 | 225
d. Define the survivorship function for each output attribute.
On the Operations panel, click the double plus symbol.

e. In the Set Column Default Value wizard, select the Function default value, first.

f. Click OK.
The Operations section is updated. The survivorship function for all attributes is designated as first. You can define the
survivorship function differently for each attribute, but because the confident score was high, you can assume that the

226 | Talend Data Stewardship - Participant Guide


confident matches are similar and the first record is acceptable.

g. To remove the matching specific attributes, select them all and click the cross [ X ] button.

h. To save the updated configuration, press Ctrl+S.

4. CONFIGURE THE OUTPUT COMPONENT


Configure the second tFileOutputDelimited component to write master records built from confident matches to a separate
file.
a. To open the Component tab, click the tFileOutputDelimited_2 component.

b. Change the File Name to C:/Temp/matches.csv

LESSON 6 | 227
c. Verify the Include Header check box is enabled.

5. SAVE THE JOB


To save the Job, press Ctrl+S.

228 | Talend Data Stewardship - Participant Guide


Run the Job
1. RUN THE JOB
On the Run tab, click the Run button.

2. EXPLORE THE RESULTS


Notice that some records were marked as unique, some as suspects, and some as confident matches.
a. In the C:/Temp folder, use Notepad++ to open the uniques.csv file.
It contains 250 unique records.

LESSON 6 | 229
In the same folder, use Notepad++
to open the matches.csv file.
The file contains 1,465 unique records (computed from 2,952 confident matches).

3. CONNECT TO DATA STEWARDSHIP
Connect as a business owner.
a. Enter these credentials:
Username: bowner@company.com
Password: talend
b. Click Log In.
The Campaigns page opens.
Confirm that 20 tasks (computed from 43 suspect records) were generated for the campaign.

c. Log out.

Next step
In the next section, you will resolve merging tasks.

230 | Talend Data Stewardship - Participant Guide


Resolving Merging Tasks

Overview

Until this point in the merging campaign, you have played the role of campaign owner. You created the campaign and gen-
erated tasks. You did not assign tasks to a data steward; they are all unassigned.
In this section, you will play the role of data steward, choosing and resolving unassigned tasks. You will also go through a
three-state validation workflow to understand the complete path followed by tasks.

Choose unassigned tasks


1. CONNECT TO DATA STEWARDSHIP
Connect as James Clark.
a. Enter these credentials:
Username: jclark@company.com
Password: talend
b. Click Log In.
The My Tasks page opens.

Notice that the LIBRARY PRODUCTS MERGE campaign has 20 unassigned tasks.

2. OPEN MERGING TASKS


a. To open the list of tasks, click the LIBRARY PRODUCTS MERGE campaign.

b. On the top menu, to open the drop-down menu, click Assigned to me.

LESSON 6 | 231
c. To see the tasks, click Unassigned.

d. To order the tasks incrementally, click the arrows in the Code column.

3. CHOOSE UNASSIGNED TASKS


a. Select the first 10 tasks and assign them to Me (James Clark).
Remember, in order to select them all at once, you need to select the first task, keep the Shift key pressed, and select
the tenth row.

232 | Talend Data Stewardship - Participant Guide


b. Next to the corresponding data steward, click the Assign tasks button.

c. Notice that only 10 tasks are unassigned, and 10 are assigned to the logged-in data steward.

LESSON 6 | 233
4. LOG OUT
In the upper right corner, click Logout.

5. LOG IN TO DATA STEWARDSHIP


Log in as Dave Smith using these credentials:
Username: dsmith@company.com
Password: talend
6. CHOOSE UNASSIGNED TASKS
As you did earlier, open the unassigned tasks and assign them to yourself.
a. To open the remaining unassigned tasks, on the menu at the top, select Unassigned.

234 | Talend Data Stewardship - Participant Guide


b. Assign the remaining unassigned tasks to yourself.
Notice that there are no more unassigned tasks, and you can see 10 assigned tasks under Me.

c. Stay logged in.


In the next section, data steward Dave Smith will start resolving his merging tasks.

Resolve merging tasks


1. VIEW YOUR TASKS
Display the tasks assigned to you.
a. On the menu at the top, click the Unassigned drop-down menu.

b. To see the tasks, click ASSIGNED TO ME.

2. OPEN THE MERGING TASK DETAILS

LESSON 6 | 235
a. Near the third task (the code is 950), click the arrow. The task details open and you can see the records that were
grouped as potential duplicates, as well as the golden record that Data Stewardship computed by following the sur-
vivorship rules established when you defined the campaign.

b. These two records definitely look like duplicates; one has the movie name misspelled.
Verify that the survivorship rules were applied when building the golden record:
The Popularity and Availability attributes were retained from the SAP source, while the price was retained from the
legacy product catalog source.

3. MARK A MERGING TASK AS READY FOR VALIDATION


To mark this task as ready for validation, in front of the task, click the white lock on the gray background. The task turns green
and the next task opens.

4. MERGE TASKS

236 | Talend Data Stewardship - Participant Guide


Notice that task 4 (code 1113) and 5 (code 1114) are both related to the movie Shrek, so you need to merge them.
a. Explore the details for tasks 4 and 5.
Notice that there is a misspelling in the movie name and the author name is not always complete, but the four records
correspond to the same product.

b. Select the two tasks.

c. On the right side of the interface, click the Row tab, and on the menu, choose Merge the selected tasks.

LESSON 6 | 237
d. The merged task looks like this:

5. BUILD THE GOLDEN RECORD FOR MERGED TASKS


You merged two tasks, and now you have a task with four records. You may want to review the golden record computed to
make sure it corresponds to the master record you will keep.

238 | Talend Data Stewardship - Participant Guide


a. On the golden record, click the Name column and change it to Shrek.

b. On the third record, click the Author column and blue arrow to define the value as the author of the movie in the
golden record.

c. Use the same process to move the Year value from the second record to the golden record.

LESSON 6 | 239
d. Use the same process to move the Sub-Category value from the fourth record to the golden record.

e. Click the white lock icon to mark the task as ready for validation.

f. The task is validated and the next task opens.

6. SPLIT TASKS
Notice that task 5 (code 1322) is a false duplicate. This means the task needs to be split.

240 | Talend Data Stewardship - Participant Guide


a. View the name of the movie in task 5 with code 1322.

b. Split the task.


Select the first record in the task. On the menu on the right, open the Row tab and click Split the task.

c. The task is split.

LESSON 6 | 241
d. Open each task and click its lock to mark it as ready for validation.

Validate tasks
1. VALIDATE THE CHOICES
You marked four tasks as ready for validation, but they are not validated.
a. In the upper right corner, click the VALIDATE CHOICES button.

b. Notice that the number of tasks for James Clark has decreased from 10 to 6. This means the validated tasks have
passed from the New state to the To validate state (as defined in the three-state workflow configured for the merging
campaign).

242 | Talend Data Stewardship - Participant Guide


2. OPEN TASKS THAT NEED TO BE VALIDATED
Recall from the definition of the campaign that the data stewards have access to tasks in the New state, while the business
owner has access to tasks in the To validate and Resolved states.
a. Log out, then use these credentials to log in as the business owner:
Username: bowner@company.com
Password: talend
b. Click Log In.
c. To open the tasks associated with this campaign, click LIBRARY PRODUCTS MERGE. 

3. DISPLAY THE TRANSITION COLUMN

LESSON 6 | 243
a. Click the first column header.

b. Enable the Transition checkbox.

4. REJECT ONE TASK
Although both records were indentical for this movie, the reference from the SAP was actually an extended version. There-
fore the Campaign owner decided to reject it.
a. By default, task 1 is selected, and you can choose to either accept or reject the work of the data steward on this task.
Click the REJECT button.

244 | Talend Data Stewardship - Participant Guide


b. The task is marked ready for validation with the transition value Reject and the next task is opened.

NOTE:
After validating a rejected task, the task goes back to the Unassigned state.

5. OPEN THE HISTORY OF A TASK


Open the history of the Shrek task.

LESSON 6 | 245
a. Hover your mouse over the Shrek task and click the clock icon.

b. The task history opens.

View the details and verify that the merge action appears on the list.

246 | Talend Data Stewardship - Participant Guide


c. Note that by default, events are sorted by newest actions. They can be sorted by oldest using the dropdown menu.
To close the task history, in the upper right corner, click the cross [ X ] button.

6. ACCEPT CHANGES TO TASKS


Approve changes for the three remaining tasks.
a. The second task is still selected. To mark it as ready for validation, click ACCEPT.

b. The task is marked as ready for validation, the transition value is designated as ACCEPT, and the next task is opened.

LESSON 6 | 247
c. Use the same process to accept changes for the other two tasks.

After validating the tasks, the accepted tasks go to the Resolved state.
7. VALIDATE CHOICES
You have marked four tasks as ready for validation, but they are not complete.
When you validate tasks, the rejected ones go back to the Unassigned state while the accepted ones go to the Resolved
state.
In the upper right corner, click VALIDATE CHOICES.

248 | Talend Data Stewardship - Participant Guide


8. DISPLAY TASKS IN THE RESOLVED STATE
a. On the menu at the top, next to TO VALIDATE, click the down arrow.
The list of available states is displayed.

b. Click Resolved.
c. The three accepted tasks are displayed.

9. DISPLAY THE REJECTED TASK


The rejected task is now in the New state and can only be seen on the Campaigns menu.

LESSON 6 | 249
a. On the menu at the top, click the arrow next to Tasks.

b. Click Campaigns.
By default, new tasks are displayed.
Confirm that the rejected task appears on the list of tasks in the New state.

Next step
Now you will query resolved tasks in order to retrieve golden records validated in Data Stewardship.

250 | Talend Data Stewardship - Participant Guide


Querying Tasks

Overview

In this section, you will use Talend Studio to query resolved tasks from the merging campaign. You will create a Job that
extracts golden records from the resolved tasks validated in Talend Data Stewardship and writes them to an output file.
This diagram shows how it works:

Building a Job that queries resolved tasks


You are ready to add a Job that reads golden records from the resolved tasks validated in Data Stewardship and writes them to a
product referential database.
1. CREATE A STANDARD JOB
Create a standard Job called QueryMergingTasks.
a. In the Repository, expand Job Designs.
Right-click Standard and select Create Standard Job.
b. The New Job window opens.
In the Name text box, enter QueryMergingTasks
In the Purpose text box, enter Read resolved tasks for merging campaign
Enter a comment of your choice in the Description text box.

LESSON 6 | 251
c. Click Finish.
The Job opens in the work area.

2. ADD AN INPUT COMPONENT

252 | Talend Data Stewardship - Participant Guide


Add a tDataStewardshipTaskInput component to the Job.

3. CONFIGURE THE INPUT COMPONENT

LESSON 6 | 253
a. Click the tDataStewardshipTaskInput component to open the Component tab.

b. In the Property Type box, select Repository.


To select metadata, click the ellipsis [...] button.

c. Select datastewardship:TDS_Connection and click OK.


The URL, Username, and Password text boxes are filled in.

254 | Talend Data Stewardship - Participant Guide


d. Click the Find a campaign button.

e. Select Library Products Merge and click OK.


f. Verify that Campaign, Label, and Type boxes are filled in.

LESSON 6 | 255
g. Notice that the Retrieve golden record only check box is selected.

4. DEFINE ADVANCED SETTINGS FOR THE INPUT COMPONENT


Configure the Job to consume tasks when they are read.
a. On the menu on the left, click Advanced settings.

b. In order to avoid tasks persistency, select the Consume tasks when they are read check box.

256 | Talend Data Stewardship - Participant Guide


c. To save the configuration, press Ctrl+S.
5. ADD AND CONNECT A MAPPING COMPONENT
Add a tMap component and connect it to the tDataStewardshipTaskInput component using a Main row.

6. CONFIGURE THE MAPPING


Define an output table in the mapping component.
a. To open the mapping editor, double-click the tMap component.
b. Add an output table named GoldenRecords.

LESSON 6 | 257
c. Click OK and map the first 11 fields to the output table.
Press SHIFT and select all 11 attributes at once.

d. To close the mapping window, click OK.


e. To save the component configuration, press Ctrl+S.
7. ADD AND CONNECT AN OUTPUT COMPONENT
Add an output component that writes golden records from resolved tasks to a separate file.
a. Add a tFileOutputDelimited component.

258 | Talend Data Stewardship - Participant Guide


b. Connect the tMap component to the tFileOutputDelimited component using a GoldenRecords row.

8. CONFIGURE THE OUTPUT


In the output component, define the path for the output file.
a. To open the Component tab, click the tFileOutputDelimited component.

b. Change the file name to C:/Temp/suspects_resolved.csv

LESSON 6 | 259
c. Select the Include Header check box.

9. SAVE THE JOB


To save the Job, press Ctrl+S.
10. TEST THE JOB

260 | Talend Data Stewardship - Participant Guide


a. In the Run view, click the Run button.

b. Confirm that three rows from the merging campaign were read. 
These correspond to the three tasks that were in the Resolved state.

Verify resolved tasks in the output file


When the previous Job ran, the suspect records resolved in Data Stewardship were written to the suspects_resolved.csv file.
1. OPEN THE OUTPUT FILE
a. In Windows Explorer, navigate to the C:/Temp folder.

LESSON 6 | 261
b. Right-click the suspects_resolved.csv file and select Edit with Notepad++.

2. EXPLORE THE OUTPUT FILE
The output file is open. Notice that three rows correspond to the three golden records you validated in Data Stewardship.

In this lesson, you wrote three types of output from the tMatchGroup component—unique, matching, and resolved suspect—to three
output files. In an actual project, you would write these records to a product referential database.

Next step
You have almost finished this section. Time for a quick review.

262 | Talend Data Stewardship - Participant Guide


Review
In this lesson, you created a merging campaign. You learned how to handle a deduplication scenario: a company that stores product
items in several internal systems must define a unique list of the items.
You learned how to:
Create a merging campaign and define survivorship rules in order to build golden records
Create merging tasks by using the tMatchGroup component in Talend Studio
Resolve merging tasks
Query resolved tasks for a campaign in order to retrieve validated golden records and write them to an output file
While using the tMatchGroup component, you completed its configuration, defining:
A blocking key
Several matching keys used to compute the overall matching score between records
A match threshold and confident match threshold. The component generates three types of output:
Unique—the calculated score is lower than the match threshold
Suspect—the calculated score is higher than the match threshold but lower than the confident match threshold
Match—the calculated score is higher than the confident match threshold

Additional information
Talend documentation:
Data matching with Talend tools
Adding a merging campaign

LESSON 6 | 263
Intentionally blank
LESSON 7
Arbitration Campaigns
This chapter discusses:

Concepts 266
Overview 269
Creating an Arbitration Campaign 270
Creating Arbitration Tasks 276
Handling Arbitration Tasks 290
Review 299
Concepts

A merging campaign enables you to decide which


fields to use to determine a match between records
and merge the records. Once data is deduplicated, the
unique list of product items can be uploaded to any
product referential system.

266 | Talend Data Stewardship - Participant Guide


LESSON 7 | 267
268 | Talend Data Stewardship - Participant Guide
Overview
In the previous lessons, you created a resolution, a grouping and a merging campaign. You generated tasks for them, and learned
how to handle tasks as a data steward.
In the next scenario, the sales team needs to apply discounts to library product items.
In Talend Data Stewardship, you will create an arbitration campaign that lets data stewards designate the correct discount value for
each product.

Objectives
After completing this lesson, you will be able to:
Define an arbitration campaign
Build a Job that creates tasks for the campaign
Assign tasks
Handle tasks
Reopen resolved tasks

Next step
You are ready to define an arbitration campaign.

LESSON 7 | 269
Creating an Arbitration Campaign

Overview

In this lesson, you will create an arbitration campaign that allows data stewards to specify the right discount for each
product.

Create an arbitration campaign


1. ADD A CAMPAIGN
Log in as the business campaign owner.
a. Enter these credentials:
Username: bowner@company.com
Password: talend
b. On the Campaigns page, click the ADD CAMPAIGN button.
The campaign form appears.

2. DEFINE THE GENERAL CHARACTERISTICS


Enter a name, description, and type for the campaign.
a. Enter these parameters:
Name: LIBRARY PRODUCTS ARBITRATION
Description: Establish the discount value for each library item

270 | Talend Data Stewardship - Participant Guide


b. Keep the ARBITRATION type selected.

c. In the QUESTION box, enter What discount rate should be used?


In the ANSWERS boxes, enter 5%, 10%, and 20%.

3. IDENTIFY CAMPAIGN OWNERS


The person who created the campaign is already designated as the owner.
Now you are adding a second campaign owner—a technical one—who will use Talend Studio to add campaign tasks.

LESSON 7 | 271
Click in the Add a campaign owner box and select towner@company.com.

4. DEFINE CAMPAIGN ROLES


Define the data steward role for the campaign.
a. Next to STEWARDS, click the Edit icon, change the role to DATA STEWARD, (upper case) and save by clicking the
check icon.

b. Click the Add a steward box and select James Clark and Dave Smith.

5. DEFINE THE DATA MODEL FOR ARBITRATION TASKS

272 | Talend Data Stewardship - Participant Guide


Click the Select a data model box and select Library Products.
The list of attributes is displayed.

6. DEFINE THE CAMPAIGN WORKFLOW


Define the steps for resolving campaign tasks.
a. Retain the simple default workflow.

LESSON 7 | 273
b. To grant data stewards access to tasks in the New state, below New, click in the Add a role box and select
DATA STEWARD.

c. To grant data stewards access to tasks in the Resolved state, below Resolved, click in the Add a role box and select
DATA STEWARD.

7. COMPLETE CAMPAIGN CREATION


Save the campaign.
a. To save, in the lower right corner, click the ADD CAMPAIGN button.

274 | Talend Data Stewardship - Participant Guide


b. The campaign appears on the list.

Next step
In the next section, you will create arbitration tasks in Talend Studio.

LESSON 7 | 275
Creating Arbitration Tasks

Overview

In this section, you will use Talend Studio to generate tasks for the arbitration campaign you created.
You will create a Job that reads a file containing library product data, and create an arbitration task for each record in that
file.
When data stewards need to handle a high volume of tasks, it is important to define task priorities. You will now assign
tasks to data stewards based on item categories: movie-related tasks go to James Clark, while the rest go to Dave Smith.
This diagram shows how it works:

Build a Job that creates arbitration tasks


1. DUPLICATE A JOB
a. Go to Talend Studio.
In the Repository, right-click the Job CreateResolutionTasks and select Duplicate.

276 | Talend Data Stewardship - Participant Guide


b. In the Input new name text box, enter CreateArbitrationTasks

LESSON 7 | 277
c. Click OK.
Double-click the Job to open it in the work area.

2. VERIFY THE INPUT COMPONENT CONFIGURATION


a. To open the Component tab, double-click the Library_Products input component.

b. You do not need to update the configuration.


In this exercise, you will use the Library_data.csv input file.

3. UPDATE THE OUTPUT TRANSITION NAME


Update the output transition name.
a. To activate editing mode, double-click the createTasks row.

278 | Talend Data Stewardship - Participant Guide


b. Change the name to createMovieTasks and press ENTER.

4. CONFIGURE THE OUTPUT COMPONENT


Update the configuration to create movie tasks for the arbitration campaign and assign them to James Clark.
a. To open the Component tab, double-click the tDataStewardshipTaskOutput component.

b. Click the Find a campaign button and select LIBRARY PRODUCTS ARBITRATION.

c. Click the Sync columns button.

LESSON 7 | 279
d. Select the Override enforcement of data model check box.

e. In the Assignee box, select jclark@company.com.

5. ADD AND CONNECT A SECOND OUTPUT COMPONENT


Duplicate the first output component and connect it to the mapping component using a Main row.
a. Copy tDataStewardshipTaskOutput_1 and paste the duplicate below the first output component.

280 | Talend Data Stewardship - Participant Guide


b. Connect the mapping and second output component using an output row named createOtherTasks.

c. Click OK. When asked if you want the schema of the target component, click Yes.

6. CONFIGURE THE SECOND OUTPUT COMPONENT


Update the configuration to assign other arbitration tasks to Dave Smith.

LESSON 7 | 281
a. To open the Component tab, double-click the tDataStewardshipTaskOutput_2 component.

b. Click the Find a campaign button and select LIBRARY PRODUCTS ARBITRATION.

In the Assignee box, select dsmith@company.com.

c. Click the Sync columns button.


d. To save the configuration, press Ctrl+S.
7. CONFIGURE MAPPING FOR THE FIRST OUTPUT TABLE

282 | Talend Data Stewardship - Participant Guide


This time, the mapping has two output tables: one for creating tasks for the movie category and one for creating tasks for
other categories.
To retrieve only tasks related to movies, apply a filter to the first output table.
a. To open the Mapping table, double-click the tMap component.

b. Click the Enable/disable expression filter icon (the arrow with a plus sign).

In the expression filter box, enter "Movies".equals(row1.Category)

c. To leave room for the second output table, in the first output table, click the Minimize icon.

8. CONFIGURE MAPPING FOR THE SECOND OUTPUT


You are ready to define the mapping and filter for the second output.

LESSON 7 | 283
a. Select the second output table and click Auto Map!.

b. Manually map the following fields:


Author>AUTHOR_DIRECTOR
Type>SUBCATEGORY
Ranking>POPULARITY

c. Click the tMapSettings icon (the wrench).

284 | Talend Data Stewardship - Participant Guide


d. For the Catch output reject value, select true.

e. To close the mapping table, click OK.


9. SAVE THE JOB
To save the Job, press Ctrl+S.

Run the Job


1. RUN THE JOB
In the Run view, click the Run button.

2. EXPLORE THE RESULTS
Notice that 1,262 movie-related tasks were assigned to James Clark, while the rest (436) were assigned to Dave Smith.
As the business owner, connect to Data Stewardship and from the Campaigns menu, click Library Products Arbit-

LESSON 7 | 285
ration.

Set task priorities


When the business owner sees that James Clark has 1,262 tasks, she realizes she needs to help him prioritize them. The default pri-
ority for tasks is medium. The business owner feels that tasks corresponding to movies available in DVD format should be handled
before the others, so she will change their priority from medium to high.

1. FILTER RECORDS 
a. Logged in as the business owner, to display tasks assigned to James Clark, on the Assign tasks to: menu, click
Jclark@company.com.

286 | Talend Data Stewardship - Participant Guide


Apply a filter to display only available items.
b. Select the Availability column. In the chart on the right, select true.

LESSON 7 | 287
c. To display only movies available in DVD format, apply a second filter.
Select the Format column, and in the chart on the right, select DVD.

2. DEFINE PRIORITY
a. On the menu on the right, click the Row tab.

b. Click Define priority.

288 | Talend Data Stewardship - Participant Guide


c. In the Selection box, select Filtered tasks.
In the Priority box, select High.

d. Click Submit.
Verify the priority for the filtered records.

e. Log out.

Next step
In the next section, you will do a couple of arbitration tasks.

LESSON 7 | 289
Handling Arbitration Tasks

Overview

You created an arbitration campaign and generated tasks for it. Now you will act as a data steward and handle some arbit-
ration tasks. Finally, you will explore resolved tasks and learn how to reopen one. This functionality would be helpful, for
example, in a scenario in which the discount rate is wrong and the task must be reviewed.

Handle arbitration tasks


1. CONNECT TO TALEND DATA STEWARDSHIP
Log out and log back in as data steward Dave Smith.
a. Enter these credentials:
Username: dsmith@company.com
Password: talend
b. Click Sign In.
The Tasks page appears.

Notice the number of tasks in the LIBRARY PRODUCTS ARBITRATION campaign.

2. OPEN THE TASK LIST

290 | Talend Data Stewardship - Participant Guide


To display the task list, click LIBRARY PRODUCTS ARBITRATION.

3. MARK A TASK AS READY FOR VALIDATION


By default, the first task is selected.
In answer to the question about the discount rate, click 10%.
The ARBITRATION value is 10%. The task is green and ready for validation.

4. FILTER THE RECORDS


You want to apply a discount rate to all books that have a popularity value of 6.
You can filter the records using the functions on the right.
To display only items with a popularity value 6,
a. select the POPULARITY column, and in the chart on the right, select 6.

LESSON 7 | 291
b. To display only books, apply a second filter.
Select the CATEGORY column, and on the right, select Books.

5. APPLY THE SAME DISCOUNT TO FILTERED RECORDS


You now need to apply the same discount, 5%, to the filtered items.

292 | Talend Data Stewardship - Participant Guide


a. Select the 11 tasks: click the first one, press and hold SHIFT, and select the last one.

LESSON 7 | 293
b. Confirm that you selected 11 items.
To apply the discount to the selected items, click 5%.

c. The ARBITRATION values are designated as 5%, and the tasks are green and ready for validation.
6. VALIDATE THE TASKS
Validate the tasks.

294 | Talend Data Stewardship - Participant Guide


a. In the upper right corner, click the VALIDATE CHOICES button.

b. Notice that the number of tasks for Dave Smith has decreased from 436 to 424.
This means the 12 tasks moved from the New to the Resolved state.

7. DISPLAY TASKS IN THE RESOLVED STATE


Remember that data stewards can access tasks in both the New and Resolved states.

LESSON 7 | 295
a. On the top menu, click the arrow next to New and select Resolved.

b. There are 12 tasks: one shows the arbitration value at 10% and the others shows it at 5%.

8. REOPEN A RESOLVED TASK
Examining the resolved tasks, the data steward realizes that the arbitration value for the first task is incorrect. He would like
to reopen and update the task.

296 | Talend Data Stewardship - Participant Guide


a. Select the first task. On the menu on the right, click the ROW tab.

b. Click Reopen tasks.

c. To reopen only the selected task, click the SUBMIT button.


The task returns to the New state, and the number of resolved tasks decreases from 12 to 11.

9. DISPLAY NEW TASKS
The reopened task appears on the list of unassigned tasks in the New state.

LESSON 7 | 297
a. On the top menu, next to Resolved, click the arrow.

b. Click the New state.


c. Click Assigned to me then select Unassigned
The reopened task appears.

Next step
You have almost finished this section. Time for a quick review.

298 | Talend Data Stewardship - Participant Guide


Review
In this lesson, you created an arbitration campaign and learned how to handle a scenario in which data stewards must answer a ques-
tion to define a new parameter for a task.
You learned how to:
Create and assign arbitration tasks in Talend Studio.
Define task priority
Resolve tasks in Data Stewardship
Reopen a resolved task

Additional information
Talend knowledge base:
Adding an arbitration campaign

LESSON 7 | 299
Intentionally blank

You might also like