Professional Documents
Culture Documents
Exploring
You will be working in actual Talend software, not a simulation. We hope you have fun and get lots of practice using the software!
However, if you work on tasks beyond the scope of the training, you could run out of time with the environment, or you could mess up
data or Jobs needed for subsequent exercises. We suggest finishing the course first, and if you have remaining time, explore as you
wish. Keep in mind that our technical support team can’t assist with your exploring beyond the course materials.
Sharing
This course is provided for your personal use under an agreement with Talend. You may not take screenshots or redistribute the con-
tent or software.
Intentionally blank
Concepts 8
Concepts
LESSON 1 | 9
Because Data Stewardship is fully integrated with the
Talend platform, it can be associated with any data
flow and integration style that Talend can manage, so
you can embed governance and stewardship in data
integration flows, MDM initiatives, and matching pro-
cesses.
Concepts 12
Overview 14
Exploring the Environment 16
Creating Users in TAC 22
Connecting to Talend Data Stewardship 36
Review 42
Concepts
Objectives
After completing this lesson, you will be able to:
Describe the Data Stewardship architecture
Create Data Stewardship users in TAC
Connect to the Data Stewardship web UI
Next step
You are ready to start exploring the environment.
LESSON 2 | 15
Exploring the Environment
Overview
In this section, you open the configuration file for the Talend Data Stewardship server and check the configuration of all
related components. Then you confirm that all the application services you need are running.
LESSON 2 | 17
a. The Talend IDP section describes the Single Sign-On properties for Talend Data Stewardship
b. MongoDBsettings are also configured in this configuration file. MongoDB is a prerequisite for the Data Stewardship
server, as it is used to store metadata for campaigns and tasks. In your training environment, MongoDB is installed
and running, and a MongoDB user has been created with required credentials.
c. Kafka settings are also configured in this file, as they ensure service messaging in the Data Stewardship server. Kafka
topics are also used for the messaging between Data Stewardship and dictionary service.
2. CONFIRM THAT TALEND ZOOKEEPER IS RUNNING
Search for Talend Zookeeper (a prerequisite for Talend Kafka service) and verify that the status is Running.
3. CONFIRM THAT TALEND KAFKA IS RUNNING
Search for Talend Kafka and verify that the status is Running.
4. CONFIRM THAT TALEND MONGODB IS RUNNING
LESSON 2 | 19
Search for Talend MongoDB and verify that the status is Running.
5. CONFIRM THAT TAC IS RUNNING
Search for Talend Administration Center 6.4.1.
If the service is not running, click the Start button and wait until the status is Running.
6. CONFIRM THAT DICTIONARY SERVICE IS RUNNING
Next step
You are ready to connect to TAC and create users.
LESSON 2 | 21
Creating Users in TAC
Overview
LESSON 2 | 23
Click Talend Administration Center.
LESSON 2 | 25
Create campaign owners in TAC
1. OPEN THE USERS MENU
On the Menu, in Settings, click Users.
2. ADD A USER
LESSON 2 | 27
f. Verify that the user details appear as in the screenshot:
LESSON 2 | 29
d. Select the Data Stewardship User check box.
e. For Data Stewardship Role, select Data Steward and Campaign Owner, then click Validate.
LESSON 2 | 31
To add the user to the list, click Save.
h. Click Save.
2. ADD A SECOND DATA STEWARD USER
Add another user who has only data steward permissions.
a. On the Users tab, click Add.
b. For the data steward user, enter the following credentials:
Login: dsmith@company.com
First Name: Dave
LESSON 2 | 33
Last Name: Smith
Password: talend
c. Select the Data Stewardship User check box.
d. For Type, select No Project Access.
e. For Data Stewardship Role, select Data Steward.
f. Make sure the Active check box is selected.
g. Verify that the user details appear as follows:
Next step
Now you can start the Data Stewardship server and test your connection.
LESSON 2 | 35
Connecting to Talend Data Stewardship
Overview
You are ready to start the Talend Data Stewardship server and test your connection as a campaign owner and data stew-
ard. You can also explore the menus available to those in different roles.
LESSON 2 | 37
c. Notice that two menus are available for a user with campaign owner permissions: Campaigns and Data Models.
3. LOG OUT OF THE DATA STEWARDSHIP WEB UI
On the menu in the upper right corner, click the Logout button.
c. Notice that four menus are available to a user with both campaign owner and data steward permissions. Campaigns,
Data Models, and Semantic Types are available only to campaign owners, while the Tasks menu is available to data
stewards.
5. LOG OUT OF THE WEB UI
On the menu in the upper right corner, click Logout.
LESSON 2 | 39
Password: talend
b. Click SIGN IN.
This screen appears:
c. Notice that just one menu is available for a user with only data steward permissions.
7. LOG OUT OF THE WEB UI
Next step
You have almost finished this section. Time for a quick review.
LESSON 2 | 41
Review
You began this lesson by learning about the architecture of the Talend Data Stewardship application, and you explored the con-
figuration file. Then you connected to TAC and created two campaign owner users and two data steward users. You learned that
Data Stewardship user roles and permissions are handled in TAC.
Finally, you started the Data Stewardship server and tested your connection using the Web UI.
Additional information
Talend documentation:
About Talend Data Stewardship
Concepts 44
Overview 49
Creating a Data Model 50
Creating a Campaign 69
Review 77
Concepts
LESSON 3 | 47
During the exercise of this lesson, you will build a data
model and create a resolution campaign.
Objectives
After completing this lesson, you will be able to:
Create a data model
Add semantic types
Define a resolution campaign
Next step
You are ready to create a data model.
LESSON 3 | 49
Creating a Data Model
Overview
In this course, you will work on data models that stores movies, books, video games, and records with descriptive details
and prices.
In this section, the first data model you will create defines attributes that have data types and specific constraints.
From TDS, you will be able to use the standard data types (for example, integer and string), as well as a set of semantic
types provided by the dictionary service. You will also learn how to add your own semantic types to the dictionary service
so you can reuse them in data models as needed.
LESSON 3 | 51
b. Change the data type to Integer.
c. The ALLOW EMPTY VALUES slide button indicates whether the library item identifier is optional. In this case the
identifier is mandatory, so deactivate the button.
d. Observe that the attribute name and type appear on the left.
LESSON 3 | 53
8. ADD AN OPTIONAL ITEM SUBCATEGORY ATTRIBUTE
Add an attribute for storing the library item subcategory (for example, science fiction or drama)
Define it like this:
Identifier: SUBCATEGORY
Name: Sub-Category
Data type: Text
Leave the ALLOW EMPTY VALUES slide button activated.
LESSON 3 | 55
10. ADD AN OPTIONAL PRICE ATTRIBUTE
Add an attribute for storing the price of the library item. The data type for this field is a decimal point followed by two digits.
Define it like this:
Identifier: PRICE
Name: Price
Data type: Decimal
Leave the ALLOW EMPTY VALUES slide button activated.
LESSON 3 | 57
12. ADD AN OPTIONAL AVAILABILITY ATTRIBUTE
Add an attribute for storing the availability of each library item. The data type for this field is boolean; only values true and
false are accepted.
Define it like this:
Identifier: AVAILABILITY
Name: Availability
Data type: Boolean
Leave the ALLOW EMPTY VALUES slide button activated.
LESSON 3 | 59
14. VERIFY THE DATA MODEL STRUCTURE
Confirm that the data model looks like this:
LESSON 3 | 61
Add a semantic category to the dictionary service
As you may have noticed while creating the data model, many predefined data types are available. In addition to the standard ones,
you can choose predefined semantic categories from the dictionary service.
Now you will add a semantic category to the list of predefined data types. This category is based on a regular expression that stores
the publishing years of library items.
Assume that the publishing year allows:
Years (1 to 4 digits)
An interval of years (1 to 4 digits-1 to 4 digits)
The associated regular expression would be ^([0-9]{1,4})(-[0-9]{1,4})?$
1. OPEN THE SEMANTIC TYPES MENU
On the menu, click Semantic Types.
LESSON 3 | 63
d. Verify the semantic type details appear as in the screenshot then click the Save and publish button.
LESSON 3 | 65
4. REORDER ELEMENTS
Move the Year and Language attributes ahead of the Category attribute.
a. You can change the order of fields by dragging and dropping or using the arrows in front of the attribute. The arrows
appear only while hovering your mouse over the attribute.
LESSON 3 | 67
5. SAVE THE DATA MODEL
To save the data model, in the lower right corner, click the EDIT DATA MODEL button.
Next step
Now you will create a campaign based on the defined data model.
Overview
Consider a scenario in which the company product catalog has product records with some values that need to be
provided or corrected. You must approve product information against a data model to make sure the catalog contains the
most complete and accurate information. Once this is done, the clean data can be uploaded to any target system, for
example, a Talend Master Data Management (MDM) system.
In this section, you will create a resolution campaign to correct library product items.
A campaign contains:
General characteristics
Roles
A data model
A workflow
LESSON 3 | 69
2. DEFINE THE GENERAL CHARACTERISTICS
Designate the name, description, and type of campaign.
a. Enter a name and description for the campaign:
Name: LIBRARY PRODUCTS RESOLUTION
Description: check the records in the library catalog and cure anomalies
b. For TYPE, click the RESOLUTION button.
The form is updated.
b. Change the role name to DATA ANALYST (uppercase) and save by clicking the check mark.
c. Click in the Add a steward text box and select the data stewards you created earlier, jclark@company.com and
dsmith@company.com.
LESSON 3 | 71
d. Click ADD A ROLE to add a second role named DATA VALIDATOR(uppercase). To save, click the check mark.
e. Click in the Add a steward text box and select the email address of the business owner, who will act as task validator.
b. The form is updated with the data model (scroll down if necessary). You can now define access rights for the two
roles, which, by default, have editing rights in all fields.
LESSON 3 | 73
Read-only access is useful if a data steward needs to access information to make a decision but is not allowed to
change the value. For instance, this could happen with foreign keys—unique IDs of other elements linked to the
entity that the steward is viewing. Or you may have data that must not be changed, for instance, the customer sup-
port level).
Hiding an attribute is useful if the information is sensitive (such as financial information) and should not be seen by
the data steward, or if it is irrelevant to the data steward (for example, ID numbers) but still needs to be propagated
as part of the task.
Restrict access as follows:
For a data analyst, allow read-only access to the Code category
For a data analyst, hide the price
You must also define which data stewards can access which task state or transition as part of the campaign workflow.
a. To add a validation step to the workflow necessary to resolve the tasks in the campaign, click the switch next to
Enable validation step.
Notice that a step is added to the workflow. This type of workflow is used when there is a need to validate tasks
handled by data stewards.
c. Below NEW, click in the Add a role text box and select DATA ANALYST. This grants the data analyst access to the
campaign tasks in the New state.
d. Below RESOLVED, click in the Add a role text box and select Data Validator.
This grants the data validator access to the campaign tasks in the Resolved state.
LESSON 3 | 75
a. To save the campaign, in the lower right corner, click the ADD CAMPAIGN button.
c. To display the Edit the campaign and Remove the campaign icons, hover on the top-right corner of the LIBRARY
PRODUCTS RESOLUTION section.
Be careful when using the Remove the campaign icon because the deletion of a campaign is irreversible.
Next step
You have almost finished this section. Time for a quick review.
Additional information
Talend knowledge base article documentation:
Managing campaigns
LESSON 3 | 77
Intentionally blank
LESSON 4
Creating and Handling tasks
This chapter discusses:
Concepts 80
Overview 83
Creating Tasks 84
Assigning and Resolving Tasks 113
Review 137
Concepts
Objectives
After completing this lesson, you will be able to:
Use Talend Studio to create tasks for a campaign
Use Talend Studio to delete tasks from a campaign
Delegate tasks
Resolve and validate tasks
Next step
You are ready to use Data Integration to create tasks for a campaign.
LESSON 4 | 83
Creating Tasks
Overview
In this section, you will use Talend Studio to create tasks for the resolution campaign you created.
You will create a new Studio project, create a Job that reads a data file exported from SAP, and create a resolution task
for each record in that file.
You will also learn how to delete all campaign tasks from a Job.
c. Click Finish.
The TalendForge page may appear.
Click Skip this step.
The Integration perspective (which you may remember from the Talend Data Integration Basics course) opens.
Create metadata
1. CREATE FILE DELIMITED METADATA
LESSON 4 | 85
In the Repository, expand Metadata.
Right-click File delimited and select Create file delimited.
b. Click Next.
In the C:/StudentFiles/TDS folder, select the Library_data.csv file.
LESSON 4 | 87
For the Format, select WINDOWS.
d. Click Next.
Change the Name to Products_metadata.
LESSON 4 | 89
For all the fields, change the data type to String and leave Length empty.
e. Click Finish.
The metadata appears in the repository in Metadata>File delimited.
LESSON 4 | 91
b. Click the Test Connection button and confirm that you get the Successful connection message.
LESSON 4 | 93
c. Click Finish.
The Job opens in the work area.
c. Add a tMap component and connect it to the input component using a Main row.
LESSON 4 | 95
d. To the right of the tMap component, add a tDataStewardshipTaskOutput component.
f. Click OK.
c. For the data stewardship metadata, select TDS_Connection and click OK.
The URL, Username, and Password boxes are filled in.
LESSON 4 | 97
d. Click the Find a campaign button.
Select Library Products Resolution and click OK.
f. Make sure the Override enforcement of data model check box is selected.
Assume that some of the input data does not respect all constraints from the defined data model. In this case, if this
parameter is not selected, the tasks are not created, as they will be in error. You want to create tasks for all records
LESSON 4 | 99
b. The schema was populated for the output component only.
c. To populate the createTasks schema with the same fields, click the double left arrow button.
When asked if you want to transfer all columns to the input schema, click OK.
LESSON 4 | 101
c. Manually map the following fields:
Author>AUTHOR_DIRECTOR
Type>SUBCATEGORY
Ranking>POPULARITY
d. Click OK.
7. SAVE THE JOB
To save the Job, press Ctrl+S.
8. TEST THE JOB
To create resolution tasks, execute the Job.
LESSON 4 | 103
b. Click Log In.
Click Campaigns.
Confirm that 1,698 tasks were created and zero are unassigned.
While configuring the Job that creates tasks for a campaign, you may make an error (for example, you may not correctly designate
the assignee), and you will only notice this when viewing generated tasks.
This is why it is important to be able to delete all tasks and re-create them, if needed.
In the next section, you will create a Job that deletes all tasks, and then, to re-create them, you will again execute the CreateRes-
olutionTasks Job.
LESSON 4 | 105
b. The New Job window opens.
In the Name text box,enter DeleteTasks
In the Purpose text box, enter Delete curation tasks for library items
b. Add a tLogRow component and connect it to the input component using a Main row.
LESSON 4 | 107
4. CONFIGURE THE INPUT COMPONENT
Configure the tDataStewardshipTaskDelete component to delete all previously created resolution tasks.
a. To open the Component tab, double-click the tDataStewardshipTaskDelete component.
LESSON 4 | 109
In the Run view, click the Run button.
After executing this Job, all tasks should disappear from the Library Products Resolution campaign.
7. RE-CREATE TASKS
In Talend Studio, open the CreateResolutionTasks Job.
LESSON 4 | 111
a. Run the Job.
b. Confirm that, as before, 1,698 rows were computed (you did not alter this Job).
Next step
Now you will act as data steward and learn how to assign and resolve resolution tasks using the Data Stewardship UI.
Overview
Until now, you played the role of campaign owner. You created a data model, a resolution campaign, and tasks for that
campaign.
In this section, acting as a data steward, you will delegate tasks to another data steward and resolve your assigned tasks.
To resolve your tasks, you will analyze inconsistencies and use available functions to simplify your work and correct data
from several tasks at once.
When you realize task definitions are not correct or complete, you will update a constraint defined in the data model and
the definition of a semantic type in the dictionary service.
Finally, in order to understand the workflow configured when you defined the campaign, you will validate the corrected
tasks and follow task evolution from one state to another.
LESSON 4 | 113
assigned tasks.
3. REORDER TASKS
4. DELEGATE TASKS
Delegate the first 10 tasks to Dave Smith.
a. To open the delegate tasks menu, click the arrow on the left.
Confirm that Dave Smith has no assigned tasks.
b. On the task list, for the first task, click the Code attribute.
To select the first 10 tasks, press and hold the SHIFT key, then select the tenth task.
LESSON 4 | 115
c. On the menu to the left, hover your mouse over Dsmith@company.com to bring up a new icon.
LESSON 4 | 117
b. To the right of the Format column, verify that the Price column does not appear (for data stewards, it is hidden in the
campaign definition).
LESSON 4 | 119
On the list, select Books.
The list is updated from 83 to 82 tasks because the modified task no longer contains an invalid value for the
b. Category attribute.
b. On the upper right, on the list of available functions, locate the replace function.
c. Select Replace the Cells that Match and configure the function like this:
Initial value: Game
Replacement value: Video Games
LESSON 4 | 121
d. Click SUBMIT.
All the Game values are replaced with Video Games, and only two tasks with invalid values remain in the Category
column.
In a real situation you would, of course, finish fixing all the records.
LESSON 4 | 123
b. Recall that the YEAR column is a semantic type that should be either a year or an interval of years.
Update the value for the task with Code 153 from 13th century to 1201-1300.
LESSON 4 | 125
a. In the Web UI, select the LANGUAGE column and click the red area.
b. You want to add Akkadian and Old Norse to the list of allowed values for the language semantic type.
2. OPEN THE LANGUAGE SEMANTIC TYPE
a. Log out and log back in as the campaign owner with these credentials:
Username: bowner@company.com
Password: talend
LESSON 4 | 127
d. Filter the semantic types by entering language.
b. In the New entry text box, enter Akkadian then validate by clicking the first tick icon (you can also press the enter key).
LESSON 4 | 129
d. Click the magnifier icon and enter the name of the items you added. (They should appear in the Values list)
e. At the bottom of the screen, click the Save and Publish button.
d. Notice that the newly added entries, Akkadian and Old Norse are now allowed for the Language semantic type (they
are no longer on the list of invalid values). In a real situation you would finish fixing all the invalid values.
e. Remove the filter.
Validate tasks
1. MARK A TASK AS READY FOR VALIDATION
Mark the first task as ready for validation.
a. Select the first task (code 11).
If the lock icon to the left of the row is white on a gray background, the task has no inconsistencies related to the
defined data model.
LESSON 4 | 131
The task turns green and is marked ready for validation, and the next task is automatically selected.
b. The lock icon is now on a red background. This means the task has inconsistencies related to the defined data model,
so it cannot be marked as ready for validation.
Update the language value to a valid one: English.
NOTE:
The objective of a resolution task is to ensure data integrity and validity. Therefore, you cannot validate a task that has
an invalid value.
3. DELETE A TASK
In your dataset, you have a record with empty name and author fields. Because this record is so low in quality, the data stew-
ard decides to delete it.
a. To select the task with the invalid name, click the red area in the NAME column).
Notice that the name and author fields are empty.
LESSON 4 | 133
c. Click Remove tasks.
d. Click Submit.
The task is deleted (no task has an invalid value for the NAME column, and the total number of tasks decreases by
one).
5. DISPLAY THE RESOLVED TASKS
Recall from the campaign definition that the data stewards can access tasks in the New state, while the business owner has
access to tasks in the Resolved state.
a. Log out, then log back in as the business owner with these credentials:
Username: bowner@company.com
Password: talend
b. Click the campaign from the Tasks menu.
By default, the resolved and unassigned tasks are displayed.
LESSON 4 | 135
You can see the two tasks resolved by James Clark.
Next step
You have almost finished this section. Time for a quick review.
You created a second Job to delete all tasks from a campaign (in case you make a mistake and want to re-create all tasks with dif-
ferent parameters). To delete tasks from a campaign in Data Stewardship, you configured the tDataStewardshipTaskDelete com-
ponent.
After creating a set of resolution tasks for a dataset from SAP, you connected to the Data Stewardship Web UI and explored the gen-
erated tasks.
You learned how to handle tasks and delegate them to data stewards.
You saw that you have a couple of data curation options, which can be useful when trying to resolve data inconsistencies.
Finally, you resolved some tasks, validated them, and verified that their states changed according to the two-state workflow defined
for the resolution campaign.
LESSON 4 | 137
Intentionally blank
LESSON 5
Grouping Campaign
This chapter discusses:
Concepts 140
Overview 143
Creating a Data Model 144
Creating a Grouping Campaign 152
Creating Grouping Tasks 161
Handling Grouping Tasks 177
Review 192
Concepts
Objectives
After completing this lesson, you will be able to:
Create a new data model
Define a grouping campaign
Build a Job that creates tasks for the campaign
Bulk assign tasks
Modify task service level agreements (SLA)
Handle tasks
Next step
You are ready to create a new data model.
LESSON 5 | 143
Creating a Data Model
Overview
You are ready to work on a data model that stores CDs and LPs with details on their format, release year, and information
source.
In this section, you create an associated data model to prepare for the next exercise.
LESSON 5 | 145
5. ADD AN OPTIONAL ITEM ALBUM ATTRIBUTE
To add a second attribute, on the menu on the left, click the ADD ATTRIBUTE button.
The attribute details form appears in the main frame.
Define it like this:
Identifier: ALBUM
Name: Album
Data type: Text
Make the field optional by leaving the ALLOW EMPTY VALUES slide button activated.
LESSON 5 | 147
8. ADD AN OPTIONAL ITEM GENRE ATTRIBUTE
Add an attribute for storing the library item subcategory (for example, science fiction or drama)
Define it like this:
Identifier: GENRE
Name: Genre
Data type: Text
Leave the ALLOW EMPTY VALUES slide button activated.
LESSON 5 | 149
10. VERIFY THE DATA MODEL STRUCTURE
Confirm that the data model looks like this:
Next step
You are ready to create a campaign based on the defined data model.
LESSON 5 | 151
Creating a Grouping Campaign
Overview
In this lesson, you create a grouping campaign that enables you to decide which records should be gathered in a common
group.
LESSON 5 | 153
4. WORK WITH SLA
By default, tasks have an infinite amount of time in which to be resolved. The campaign owner can define a maximum num-
ber of days to resolve tasks. When the time is exceeded, the task will appear in red.
Note: This feature uses a simple day count by default and weekends and holidays are considered the same as working
days. However, when generating the tasks using a Studio Job, SLA can also be defined. Doing so, more advanced cal-
culation rules or external calendars can be leveraged.
LESSON 5 | 155
b. Enter 5 in the Delay text box.
LESSON 5 | 157
f. Click the Add a steward text box and select the business campaign owner, who will be the task validator.
LESSON 5 | 159
b. To grant the task validator access to the campaign tasks in the Resolved state, below RESOLVED, click the Add a
role text box and select TASK VALIDATOR.
Next step
You are ready to learn how to create grouping tasks in Talend Studio.
Overview
In this exercise, you create a Job that reads an aggregated file from SAP and the legacy product catalog and creates a
grouping task for each record in that file.
This diagram shows how it works:
Create metadata
1. CREATE FILE DELIMITED METADATA
a. In the Repository, expand Metadata.
b. Right-click File delimited and select Create file delimited.
LESSON 5 | 161
a. In the the Name text box, enter Album_Catalog and enter a purpose and description.
b. Click Next.
c. In the C:/StudentFiles/TDS folder, select the albums.csv file.
LESSON 5 | 163
e. Click Next, select the Set heading row as column names check box, and click Refresh Preview.
f. Click Next.
Change Name to Albums_metadata.
LESSON 5 | 165
g. Click Finish.
The metadata appears in the repository in Metadata>File delimited.
LESSON 5 | 167
a. In the Repository, expand the Metadata folder.
Drag the Album_Catalog metadata into the work area. On the list of components, select tFileInputDelimited.
b. Click OK.
The component is added to the Job.
d. Drag the TDS_Connection metadata to the right of the tMap component in the work area. Then select tDataSte-
wardshipTaskOutput.
e. Click OK.
LESSON 5 | 169
f. To open the Component tab, double-click the tDataStewardshipTaskOutput component.
g. Click the Find a campaign button, select LIBRARY PRODUCTS GROUPING, and click OK.
k. Click OK
LESSON 5 | 171
l. Click Yes when asked if you want to get the schema of the target component.
LESSON 5 | 173
c. Manually map the following fields:
GID>TDS_GID
d. Click OK.
5. SAVE THE JOB
To save the Job, press Ctrl+S.
6. TEST THE JOB
To create resolution tasks, execute the Job.
LESSON 5 | 175
a. In the Run view, click the Run button.
Next step
As a campaign owner, you are ready to assign tasks, act as a data steward, and handle grouping tasks.
Overview
You created a grouping campaign and generated tasks for it. Now you can act as a campaign owner, dispatching tasks to
data stewards. Once done, you will also act as a data steward, handling some arbitration tasks. Unlike the resolution cam-
paign, handling tasks in a grouping campaign does not consist in correcting records but answering a question about a
group of records.
LESSON 5 | 177
To display the tasks you created earlier, click LIBRARY PRODUCTS GROUPING.
LESSON 5 | 179
a. Select the first half of the tasks: click the first one, press and hold SHIFT, and select task 51.
d. Update the Due date text box by adding two days to the current date, then click the SUBMIT button.
LESSON 5 | 181
e. Verify that the due date has been updated in the DUE DATE column of the list for only the first 51 tasks.
b. Click the [+] to the right of Assign tasks to expand the menu.
LESSON 5 | 183
c. Select All tasks on the Selection drop-down menu and click the SUBMIT button.
LESSON 5 | 185
8. MARK A TASK AS READY FOR VALIDATION
The task is green and ready for validation, and the choice is visible from the Arbitration column.
LESSON 5 | 187
9. BULK RESOLVE TASKS USING THE MENU ON THE RIGHT
LESSON 5 | 189
c. Click the [+] to the right of Arbitrate tasks to expand the menu.
d. Select Not sure on the Arbitration drop-down menu and click the SUBMIT button.
Next step
You have almost finished this section. Time for a quick review.
LESSON 5 | 191
Review
In this lesson, you created a grouping campaign and learned how to handle a scenario in which data stewards must validate the cor-
relation of task records by answering a question.
You learned how to:
Create a grouping campaign
Create and assign grouping tasks in Talend Studio
Define task SLA
Bulk assign tasks to data stewards
Resolve grouping tasks in Data Stewardship
Additional information
Talend knowledge base:
Adding a grouping campaign
Handling grouping tasks
Concepts 194
Overview 199
Creating a Merging Campaign 200
Creating Merging Tasks 209
Resolving Merging Tasks 231
Querying Tasks 251
Review 263
Concepts
LESSON 6 | 195
196 | Talend Data Stewardship - Participant Guide
Matching functions
Soundex: phonetic algorithm (English) that indexes
strings with a code
For example: “Hello”→“H400”
Advice: Soundex yields many false positives; better
using Double Metaphone
Metaphone: Phonetic algorithm that encodes strings
Double Metaphone: New version of Metaphone
Levenshtein (edit distance): Computes minimum num-
ber of edits (insertions, deletions, substitutions)
needed to transform one string into the other
Jaro: Counts number of matched characters and
transpositions
Jaro-Winkler: same as Jaro but with more importance
given to beginning of word
Best suited for short strings such like names of people
Q-grams: slides a window of length q over the string
and compares number of matching q-grams over total
number of q-grams
Trigrams (q=3) is currently implemented
For example: « match » → « **m », « *ma », « mat », «
atc », « tch », « ch* », « h** »
LESSON 6 | 197
198 | Talend Data Stewardship - Participant Guide
Overview
In the previous lessons, you created a resolution and a grouping campaign.
In this lesson, you will create a merging campaign, generate tasks for it, and resolve and validate the tasks by implementing a three-
state validation workflow.
Imagine that a company needs to merge data from two systems: SAP and the legacy product catalog. The merging campaign lets
you choose fields to determine a match between records, and then merge the records. Once data is deduplicated, the unique list of
product items can be uploaded to any product referential system.
In this lesson, you receive a dataset containing library products from two sources, which contain unique products as well as duplic-
ates. Some are clearly duplicates, but others are just "suspects" that a data steward needs to confirm as duplicates.
To facilitate extraction of a set of unique library products from this dataset, you will:
Create a merging campaign in Data Stewardship
Create a Job that computes the input dataset, creates a list of unique items, creates a list of definite duplicates, and gen-
erates tasks for suspects
Resolve and validate merging tasks
In order to retrieve corresponding golden records (unique records calculated from the merge of two suspect records), build
a Job that consumes tasks validated in Data Stewardship
Objectives
After completing this lesson, you will be able to:
Define a merging campaign
Build a Job that implements record matching and generates tasks for suspects
Resolve and validate merging tasks
Build a Job that consumes resolved tasks
Next step
You are ready to define a merging campaign.
LESSON 6 | 199
Creating a Merging Campaign
Overview
In this lesson, you will create a merging campaign that enables you to decide which fields to use to determine a match
between suspect records, and then merge records to create a golden record.
LESSON 6 | 201
4. DEFINE THE CAMPAIGN OWNERS
By default, the campaign owner creating the campaign is assigned to it.
a. Add a second, technical campaign owner; this role will be used for adding tasks to the campaign using Talend Studio.
b. Click in the Add a campaign owner text box and select towner@company.com.
e. Click the Add a steward text box and select the business campaign owner, who will be the task validator.
LESSON 6 | 203
a. Click in the Select a data model text box and select Library Products.
b. To grant the data steward access to the campaign tasks in the New state, below NEW, click in the Add a role box and
select DATA STEWARD.
d. To grant the task validator access to the campaign tasks in the Resolved state, below RESOLVED, click in the Add a
role box and select TASK VALIDATOR.
LESSON 6 | 207
a. In the lower right corner, click the ADD CAMPAIGN button.
Next step
Now you will learn how to create merging tasks in Talend Studio.
Overview
In this section, you will use Talend Studio to generate tasks for the merging campaign you created.
You will create a Job that:
Reads a file containing product items exported from SAP and the legacy product catalog
Defines matching rules that will be applied to match records and group the records by similarity
Create a list of unique items and a list of clear duplicates, then generate tasks only for suspect duplicates
This diagram shows how it works:
LESSON 6 | 209
c. In the Input new name box, enter CreateMergingTasks.
LESSON 6 | 211
c. Change the file name to C:/StudentFiles/TDS/Library_data_Sources.csv.
This file contains product items from both SAP and the legacy product catalog.
d. Click the ellipsis [...] button next to Edit schema, and to transfer the schema to the mapping component in the
schema window, click the double left arrow button.
When asked for validation, click OK.
LESSON 6 | 213
h. To save the updated Job, press Ctrl+S.
4. ADD A MATCHING COMPONENT
To group similar records, you need to add a tMatchGroup component.
a. Between the input and mapping components, delete the Main row.
NOTE:
Two matching algorithms are available:
Simple VSR and T-Swoosh.
T-Swoosh is a sequential entity resolution algorithm that allows you to match and preserve records in a
sequential way. The algorithm calculates a golden record for each group of records by using the survivorship
rules defined in the configuration of this component.
The Simple VSR algorithm takes a set of records as input, and groups similar encountered duplicates accord-
ing to defined match rules. It compares pairs of records and assigns them to groups. The first record pro-
cessed in each group is the master (no survivorship rules are defined for this algorithm).
LESSON 6 | 215
At this stage, you want to group only similar records without calculating the golden record. You want to let Data
Stewardship calculate the golden record according to the survivorship rules defined when the merging campaign was
created. As a result, in this exercise you will configure the Simple VSR algorithm.
On the Blocking Selection panel, click the plus symbol [ + ] and select the Category column.
The goal of the blocking column is to create groups of records that limit the number of comparisons during prob-
abilistic matching.
In this case, you saw that the Category attribute contains very clean data and allows you to create smaller groups for
the matching computation.
LESSON 6 | 217
e. Click OK
f. To save the Job, press Ctrl+S.
7. DEFINE A SEPARATE OUTPUT FOR THE MATCHING COMPONENT
a. On the Component tab, click Advanced settings.
LESSON 6 | 219
b. The Job looks like this:
b. To map the input fields to the createTasks output table, replace row1 with row2. Then map the matching specific
components:
TDS_GID: GID
TDS_MASTER: false
NOTE: The TDS_MASTER field must be set to false because no master record is generated for suspects. Recall that
when using the SimpleVSR algorithm, no golden record was calculated. Data Stewardship takes care of creating it by
following the survivorship rules defined in the merging campaign.
LESSON 6 | 221
b. Connect it to the tMatchGroup component using a Uniques row.
Because the tMatchGroup component does not calculate a golden record for confident matches (remember that you used
the Simple VSR matching algorithm), you first need to calculate a golden record for the groups of confident matches. Then you will
write these golden records to an output file.
1. ADD A SURVIVE FIELDS COMPONENT
The tSurviveFields component receives a flow and merges it based on one or more columns.
a. Below the tMap component, add a tSurviveFields component.
LESSON 6 | 223
b. Connect it to the tMatchGroup component using a matches row.
b. Connect it to the tSurviveFields component using a Main row and answer yes when asked if you want to get the
schema of the target component.
LESSON 6 | 225
d. Define the survivorship function for each output attribute.
On the Operations panel, click the double plus symbol.
e. In the Set Column Default Value wizard, select the Function default value, first.
f. Click OK.
The Operations section is updated. The survivorship function for all attributes is designated as first. You can define the
survivorship function differently for each attribute, but because the confident score was high, you can assume that the
g. To remove the matching specific attributes, select them all and click the cross [ X ] button.
LESSON 6 | 227
c. Verify the Include Header check box is enabled.
LESSON 6 | 229
In the same folder, use Notepad++
to open the matches.csv file.
The file contains 1,465 unique records (computed from 2,952 confident matches).
3. CONNECT TO DATA STEWARDSHIP
Connect as a business owner.
a. Enter these credentials:
Username: bowner@company.com
Password: talend
b. Click Log In.
The Campaigns page opens.
Confirm that 20 tasks (computed from 43 suspect records) were generated for the campaign.
c. Log out.
Next step
In the next section, you will resolve merging tasks.
Overview
Until this point in the merging campaign, you have played the role of campaign owner. You created the campaign and gen-
erated tasks. You did not assign tasks to a data steward; they are all unassigned.
In this section, you will play the role of data steward, choosing and resolving unassigned tasks. You will also go through a
three-state validation workflow to understand the complete path followed by tasks.
b. On the top menu, to open the drop-down menu, click Assigned to me.
LESSON 6 | 231
c. To see the tasks, click Unassigned.
d. To order the tasks incrementally, click the arrows in the Code column.
c. Notice that only 10 tasks are unassigned, and 10 are assigned to the logged-in data steward.
LESSON 6 | 233
4. LOG OUT
In the upper right corner, click Logout.
LESSON 6 | 235
a. Near the third task (the code is 950), click the arrow. The task details open and you can see the records that were
grouped as potential duplicates, as well as the golden record that Data Stewardship computed by following the sur-
vivorship rules established when you defined the campaign.
b. These two records definitely look like duplicates; one has the movie name misspelled.
Verify that the survivorship rules were applied when building the golden record:
The Popularity and Availability attributes were retained from the SAP source, while the price was retained from the
legacy product catalog source.
4. MERGE TASKS
c. On the right side of the interface, click the Row tab, and on the menu, choose Merge the selected tasks.
LESSON 6 | 237
d. The merged task looks like this:
b. On the third record, click the Author column and blue arrow to define the value as the author of the movie in the
golden record.
c. Use the same process to move the Year value from the second record to the golden record.
LESSON 6 | 239
d. Use the same process to move the Sub-Category value from the fourth record to the golden record.
e. Click the white lock icon to mark the task as ready for validation.
6. SPLIT TASKS
Notice that task 5 (code 1322) is a false duplicate. This means the task needs to be split.
LESSON 6 | 241
d. Open each task and click its lock to mark it as ready for validation.
Validate tasks
1. VALIDATE THE CHOICES
You marked four tasks as ready for validation, but they are not validated.
a. In the upper right corner, click the VALIDATE CHOICES button.
b. Notice that the number of tasks for James Clark has decreased from 10 to 6. This means the validated tasks have
passed from the New state to the To validate state (as defined in the three-state workflow configured for the merging
campaign).
LESSON 6 | 243
a. Click the first column header.
4. REJECT ONE TASK
Although both records were indentical for this movie, the reference from the SAP was actually an extended version. There-
fore the Campaign owner decided to reject it.
a. By default, task 1 is selected, and you can choose to either accept or reject the work of the data steward on this task.
Click the REJECT button.
NOTE:
After validating a rejected task, the task goes back to the Unassigned state.
LESSON 6 | 245
a. Hover your mouse over the Shrek task and click the clock icon.
View the details and verify that the merge action appears on the list.
b. The task is marked as ready for validation, the transition value is designated as ACCEPT, and the next task is opened.
LESSON 6 | 247
c. Use the same process to accept changes for the other two tasks.
After validating the tasks, the accepted tasks go to the Resolved state.
7. VALIDATE CHOICES
You have marked four tasks as ready for validation, but they are not complete.
When you validate tasks, the rejected ones go back to the Unassigned state while the accepted ones go to the Resolved
state.
In the upper right corner, click VALIDATE CHOICES.
b. Click Resolved.
c. The three accepted tasks are displayed.
LESSON 6 | 249
a. On the menu at the top, click the arrow next to Tasks.
b. Click Campaigns.
By default, new tasks are displayed.
Confirm that the rejected task appears on the list of tasks in the New state.
Next step
Now you will query resolved tasks in order to retrieve golden records validated in Data Stewardship.
Overview
In this section, you will use Talend Studio to query resolved tasks from the merging campaign. You will create a Job that
extracts golden records from the resolved tasks validated in Talend Data Stewardship and writes them to an output file.
This diagram shows how it works:
LESSON 6 | 251
c. Click Finish.
The Job opens in the work area.
LESSON 6 | 253
a. Click the tDataStewardshipTaskInput component to open the Component tab.
LESSON 6 | 255
g. Notice that the Retrieve golden record only check box is selected.
b. In order to avoid tasks persistency, select the Consume tasks when they are read check box.
LESSON 6 | 257
c. Click OK and map the first 11 fields to the output table.
Press SHIFT and select all 11 attributes at once.
LESSON 6 | 259
c. Select the Include Header check box.
b. Confirm that three rows from the merging campaign were read.
These correspond to the three tasks that were in the Resolved state.
LESSON 6 | 261
b. Right-click the suspects_resolved.csv file and select Edit with Notepad++.
2. EXPLORE THE OUTPUT FILE
The output file is open. Notice that three rows correspond to the three golden records you validated in Data Stewardship.
In this lesson, you wrote three types of output from the tMatchGroup component—unique, matching, and resolved suspect—to three
output files. In an actual project, you would write these records to a product referential database.
Next step
You have almost finished this section. Time for a quick review.
Additional information
Talend documentation:
Data matching with Talend tools
Adding a merging campaign
LESSON 6 | 263
Intentionally blank
LESSON 7
Arbitration Campaigns
This chapter discusses:
Concepts 266
Overview 269
Creating an Arbitration Campaign 270
Creating Arbitration Tasks 276
Handling Arbitration Tasks 290
Review 299
Concepts
Objectives
After completing this lesson, you will be able to:
Define an arbitration campaign
Build a Job that creates tasks for the campaign
Assign tasks
Handle tasks
Reopen resolved tasks
Next step
You are ready to define an arbitration campaign.
LESSON 7 | 269
Creating an Arbitration Campaign
Overview
In this lesson, you will create an arbitration campaign that allows data stewards to specify the right discount for each
product.
LESSON 7 | 271
Click in the Add a campaign owner box and select towner@company.com.
b. Click the Add a steward box and select James Clark and Dave Smith.
LESSON 7 | 273
b. To grant data stewards access to tasks in the New state, below New, click in the Add a role box and select
DATA STEWARD.
c. To grant data stewards access to tasks in the Resolved state, below Resolved, click in the Add a role box and select
DATA STEWARD.
Next step
In the next section, you will create arbitration tasks in Talend Studio.
LESSON 7 | 275
Creating Arbitration Tasks
Overview
In this section, you will use Talend Studio to generate tasks for the arbitration campaign you created.
You will create a Job that reads a file containing library product data, and create an arbitration task for each record in that
file.
When data stewards need to handle a high volume of tasks, it is important to define task priorities. You will now assign
tasks to data stewards based on item categories: movie-related tasks go to James Clark, while the rest go to Dave Smith.
This diagram shows how it works:
LESSON 7 | 277
c. Click OK.
Double-click the Job to open it in the work area.
LESSON 7 | 279
d. Select the Override enforcement of data model check box.
c. Click OK. When asked if you want the schema of the target component, click Yes.
LESSON 7 | 281
a. To open the Component tab, double-click the tDataStewardshipTaskOutput_2 component.
b. Click the Enable/disable expression filter icon (the arrow with a plus sign).
c. To leave room for the second output table, in the first output table, click the Minimize icon.
LESSON 7 | 283
a. Select the second output table and click Auto Map!.
2. EXPLORE THE RESULTS
Notice that 1,262 movie-related tasks were assigned to James Clark, while the rest (436) were assigned to Dave Smith.
As the business owner, connect to Data Stewardship and from the Campaigns menu, click Library Products Arbit-
LESSON 7 | 285
ration.
1. FILTER RECORDS
a. Logged in as the business owner, to display tasks assigned to James Clark, on the Assign tasks to: menu, click
Jclark@company.com.
LESSON 7 | 287
c. To display only movies available in DVD format, apply a second filter.
Select the Format column, and in the chart on the right, select DVD.
2. DEFINE PRIORITY
a. On the menu on the right, click the Row tab.
d. Click Submit.
Verify the priority for the filtered records.
e. Log out.
Next step
In the next section, you will do a couple of arbitration tasks.
LESSON 7 | 289
Handling Arbitration Tasks
Overview
You created an arbitration campaign and generated tasks for it. Now you will act as a data steward and handle some arbit-
ration tasks. Finally, you will explore resolved tasks and learn how to reopen one. This functionality would be helpful, for
example, in a scenario in which the discount rate is wrong and the task must be reviewed.
LESSON 7 | 291
b. To display only books, apply a second filter.
Select the CATEGORY column, and on the right, select Books.
LESSON 7 | 293
b. Confirm that you selected 11 items.
To apply the discount to the selected items, click 5%.
c. The ARBITRATION values are designated as 5%, and the tasks are green and ready for validation.
6. VALIDATE THE TASKS
Validate the tasks.
b. Notice that the number of tasks for Dave Smith has decreased from 436 to 424.
This means the 12 tasks moved from the New to the Resolved state.
LESSON 7 | 295
a. On the top menu, click the arrow next to New and select Resolved.
b. There are 12 tasks: one shows the arbitration value at 10% and the others shows it at 5%.
8. REOPEN A RESOLVED TASK
Examining the resolved tasks, the data steward realizes that the arbitration value for the first task is incorrect. He would like
to reopen and update the task.
9. DISPLAY NEW TASKS
The reopened task appears on the list of unassigned tasks in the New state.
LESSON 7 | 297
a. On the top menu, next to Resolved, click the arrow.
Next step
You have almost finished this section. Time for a quick review.
Additional information
Talend knowledge base:
Adding an arbitration campaign
LESSON 7 | 299
Intentionally blank