You are on page 1of 84

Lab Guide

Big Data Advanced - YARN


Version 6.0
Copyright 2015 Talend Inc. All rights reserved.
Information in this document is subject to change without notice. The software described in this document is furnished under a license
agreement or nondisclosure agreement. The software may be used or copied only in accordance with the terms of those agree-
ments. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or any means electronic
or mechanical, including photocopying and recording for any purpose other than the purchaser's personal use without the written
permission of Talend Inc.
Talend Inc.
800 Bridge Parkway, Suite 200
Redwood City, CA 94065
United States
+1 (650) 539 3200
Welcome to Talend Training
Congratulations on choosing a Talend training module. Take a minute to review the following points to help you get the most from
your experience.

Technical Difficulty

Instructor-Led
If you are following an instructor-led training (ILT) module, there will be periods for questions at regular intervals. However, if you
need an answer in order to proceed with a particular lab, or if you encounter a situation with the software that prevents you from pro-
ceeding, don’t hesitate to ask the instructor for assistance so it can be resolved quickly.

Self-Paced
If you are following a self-paced, on-demand training (ODT) module, and you need an answer in order to proceed with a particular
lab, or you encounter a situation with the software that prevents you from proceeding with the training module, a Talend professional
consultant can provide assistance. Double-click the Live Expert icon on your desktop to go to the Talend Live Support login page
(you will find your login and password in your ODT confirmation email). The consultant will be able to see your screen and chat with
you to determine your issue and help you on your way. Please be considerate of other students and only use this assistance if you are
having difficulty with the training experience, not for general questions.

Exploring
Remember that you are interacting with an actual copy of the Talend software, not a simulation. Because of this, you may be tempted
to perform tasks beyond the scope of the training module. Be aware that doing so can quickly derail your learning experience, leaving
your project in a state that is not readily usable within the tutorial, or consuming your limited lab time before you have a chance to fin-
ish. For the best experience, stick to the tutorial steps! If you want to explore, feel free to do so with any time remaining after you've fin-
ished the tutorial (but note that you cannot receive consultant assistance during such exploration).

Additional Resources
After completing this module, you may want to refer to the following additional resources to further clarify your understanding and
refine and build upon the skills you have acquired:
Talend product documentation (help.talend.com)
Talend Forum (talendforge.org/)
Documentation for the underlying technologies that Talend uses (such as Apache) and third-party applications that com-
plement Talend products (such as MySQL Workbench)
This page intentionally left blank to ensure new chapters
start on right (odd number) pages.

CONTENTS | Lab Guide


CONTENTS
LESSON 1 Clickstream Use Case
Clickstream use case 10
Overview 10
Set up development environment 12
Overview 12
Hadoop cluster connection 12
Context variables 16
Load data files into HDFS 18
Overview 18
Delete existing files 18
Load Data into HDFS 19
Run job 20
Retrieve schema 21
Enrich logs 23
Overview 23
Read files into HDFS 23
Mapping weblogs and products 25
Mapping with users 26
Store results in HDFS 28
Run Job and check results 30
Compute statistics 32
Overview 32
Read enriched logs 32
Filter data 33
Compute statistics 33
Run job and check results 34
Convert to Big Data Batch Job 36
Overview 36
Convert to Big Data Batch Job 36
Run job and check results 38
Understanding MapReduce jobs 40
Overview 40
Read enriched logs 40
Aggregate data 41
Filter data 42
Run job and check results 42
Understanding MapReduce Jobs 44
Challenges 47
Overview 47
Challenge 47
Solutions 48
Overview 48
How many men/women from NY state? 48
What did they buy? 49
Most/Less interesting categories? 50
Wrap-Up 51
Recap 51
Further Reading 51

LESSON 2 Sentiment Analysis Use Case


Sentiment analysis 54
Overview 54
Load Lookup Data Into HDFS 55
Overview 55
Clean-up HDFS 55
Load Dictionary data into HDFS 57
Load Time zone data into HDFS 59
Run Job and check results 61
Load tweets into HDFS 62
Overview 62
Query Twitter 63
Extract useful information 63
Standardizing text 66
Combine tweets and dictionary 67
Aggregate and load tweets into HDFS 71
Context variable creation 72
Run job and check results 73
Process tweets with MapReduce 74
Overview 74
Collect Tweets and Time zone data from HDFS 74
Data mapping 76
Aggregate data 77
Load results into HDFS 78
Results visualization 79
Challenge 81
Overview 81
Exercise 81
Solutions 82
Overview 82
Solution 82

CONTENTS | Lab Guide


Wrap-Up 84
Recap 84
Further Reading 84
This page intentionally left blank to ensure new chapters
start on right (odd number) pages.

CONTENTS | Lab Guide


LESSON 1
Clickstream Use Case
This chapter discusses the following.

Clickstream use case 10


Set up development environment 12
Load data files into HDFS 18
Enrich logs 23
Compute statistics 32
Convert to Big Data Batch Job 36
Understanding MapReduce jobs 40
Challenges 47
Solutions 48
Wrap-Up 51
Clickstream use case

Overview
The Clickstream example was originally developed by Hortonworks and then Talend provided a joint session with Hortonworks to
show users how the same result can be achieved without hand-coding.

Clickstream data provides insights to companies on how users are browsing their product web pages and what flow they go through
to get to the end product.
Omniture is one of the company that provides this type of data. In this Clickstream use case, you will load data files into HDFS and
then use a Talend Big Data Batch job to enrich the data and calculate different results that will be stored in a file for further invest-
igations.

Objectives
After completing this lab, you will be able to:
Create Metadata for your Hadoop cluster connection
Configure context variables
Load multiple files into HDFS
Check data with the Data viewer
Retrieve the schema of a file, using the Talend wizard
Convert a Standard Job to a Big Data Batch Job

Before you begin


Be sure that you have completed the Big Data Basics course, or are comfortable with the topics covered there.
Be sure that you are working in an environment that contains the following:
A properly installed Talend Big Data Platform
The supporting files for this lab
A properly configured Hadoop cluster
Everything has already been set up for you in your training environment.

10 | Big Data Advanced - YARN - Lab Guide


The first step is to set up your development context.

LESSON 1 | 11
Set up development environment

Overview
In this section, you will define metadata to set up your environment. That means centralize information you have to define in multiple
components, such as, your Hadoop cluster connection information, or the folders in HDFS or local that you will use to read/write
data.
First, you will create metadata for the Hadoop cluster connection.

Hadoop cluster connection


1. Start the Talend Studio and open the BDAdvanced_YARN Project.
2. In the Repository, under Metadata, right-click Hadoop Cluster and then click Create Hadoop Cluster.
3. Name your new cluster connection TrainingCluster, in the Name box and then click Next.
4. In the Distribution list, select Cloudera and in the Version list, select Cloudera CDH5.4(YARN mode).
5. Select Retrieve configuration from Ambari or Cloudera and then, click Next.
6. In the Manager URI (with port) box, enter "http://ClusterCDH54:7180".
7. In the Username and Password boxes, enter "admin".
8. Click Connect. This will list all the clusters administered by the Cloudera Manager:

12 | Big Data Advanced - YARN - Lab Guide


9. Click Fetch. The wizard will retrieve the configurations files of all running services in your cluster:

10. Click Finish.


11. Enter "student" (without quotes) in the User name box:

LESSON 1 | 13
12. Click Check Services to validate your configuration.

14 | Big Data Advanced - YARN - Lab Guide


If the test fails, check your configuration and try again.
13. Click Close to exit the Check Hadoop Services window.
14. Click Finish to save.
The TrainingCluster metadata automatically appears in the Repository.
15. Expand TrainingCluster under Metadata/HadoopCluster:

The wizard created the cluster metadata and it also created metadata for various services such as HDFS, HBase, Hive and
Oozie.
16. Click HDFS(1) and double-click TrainingCluster_HDFS:

17. In the Name box, enter HDFS_connection.


18. Click Next.

LESSON 1 | 15
19. Enter student in the User name box and select Tabulation in the Field Separator list.
Your configuration should be as follows:

20. Click the Check button to check your connection. You should see the following message :

21. Click OK to close the message window and then click Finish to save your HDFS connection.
Now, you should see HDFS_connection under Hadoop Cluster/TrainingCluster/HDFS in Metadata.

Context variables
Sometimes, it is necessary to give the same information again and again in multiple components, file folders for example. And if you
ever had to change the folder path, you would have to change it manually in every component that needs this folder. To avoid this
painful task, you will create context variables. This way, you would only have to change the context variable's value.
In the next job, you will create two context variables. One for the local directory, and another one for the HDFS directory. You will
need those directory to read and write data.
1. In the Repository, right-click Contexts and then, click Create context group in the contextual menu.
2. In the Name box, enter "ClickStream_context" (without quotes) and then click Next.
3. Below the table, click the green plus sign to create a new variable.
4. Below Name, click new1 to edit and then enter "LOCAL_ROOT_DIRECTORY" (without quotes).
5. Follow the same procedure to create a context variable called HDFS_ROOT_DIRECTORY.
6. Now you will give values to those context variables.
Click the box below Value in the Default column. This is where you will give a value to LOCAL_ROOT_DIRECTORY.
7. Enter "C:/StudentFiles/ClickStream/" (including the quotes).
8. For HDFS_ROOT_DIRECTORY variable, give the following value:
"/user/student/clickstream/"

16 | Big Data Advanced - YARN - Lab Guide


You should have the following table:

9. Click Finish to save your context variables.


Next, you will use context variables and metadata in a job to load your Omniture log files into HDFS.

LESSON 1 | 17
Load data files into HDFS

Overview
Now, you will start to build jobs to analyze your clickstream dataset.
Clickstream log files are unstructured and can be found in C:\StudentFiles\ClickStream\Omniture.tsv
You can edit the file to view its content.
The raw data file contains information such as URL, timestamp, IP address, geocoded IP address, and user ID (SWID).
The Omniture log datasets contains about 4 million rows of data, which represents five days of clickstream data. Often organizations
will process weeks, months, or even years of data.
As an example of what is inside Omniture.tsv file, here is the first line of the log:

Using HiveQL you could process and access the data but, instead of hand coding, you will create a Talend job to import all data files
into HDFS, and then create another job to perform processing.

Delete existing files


First, you need a new job to load data files into HDFS
1. In the Repository, under Job Designs/Standard Jobs, create a ClickStream folder.
2. Right-click ClickStream, and then, click Create Standard Job and name it "LoadWebLogsHDFS" (without quotes).
3. Click Finish to create your new job.
4. In the Designer view, add a tHDFSDelete component. It will clean up the HDFS folder where you will store your data files.
5. Double-click to open the Component view.
6. To configure your component, you will use the Metadata Hadoop cluster connection you defined previously.
In the Property Type list, select Repository.
7. Click the ellipsis (...) button to open the Repository content browser and then browse to find HDFS_connection.
8. Click OK.
9. In the File or Directory Path box, you will use a context variable previously defined and which gives the path to your root
HDFS directory.
In the File or Directory path box, enter:
context.HDFS_ROOT_DIRECTORY+"input"
Note that you can use the Ctrl-Space shortcut to get context variables name.
Your configuration should look like the following:

18 | Big Data Advanced - YARN - Lab Guide


Load Data into HDFS
You will have 4 files to load into HDFS : Omniture.tsv, regusers.tsv, urlmap.tsv and states.tsv. These files can be found in C:/Stu-
dentFiles/ClickStream.
In Omniture.tsv, you will find the Clickstream log files, as mentioned before.
In regusers.tsv, you will find information on users : ID, birth date and gender.
In urlmap.tsv, you will find urls and corresponding categories.
In states.tsv, you will find information on states : Postal code, State name, Capital and most populous city.
1. Add a tHDFSPut component in the Designer view, at the right side of tHDFSDelete_1 and then connect it with an OnSub-
jobOK trigger.
2. In the Component view, select Repository in the Property Type list. And as done for tHDFSDelete_1, browse to find
your HDFSconnection metadata.
3. In the Local Directory box, enter:
context.LOCAL_ROOT_DIRECTORY
4. In the HDFS Directory box, enter:
context.HDFS_ROOT_DIRECTORY+"input"
5. Select always in the Overwrite File list.

LESSON 1 | 19
6. Below the Files table, click the green plus sign to add 4 files in the table. Configure it as shown below:

Run job
Your job is really simple and should be as follows:

Before running your job, you must set your context otherwise, you will have an error message because context.HDFS_ROOT_
DIRECTORY and context.LOCAL_ROOT_DIRECTORY are unknown.
1. Click the Contexts tab, next to Run and Component view.
2. Below the Variables table, click the Select Context variable button.

20 | Big Data Advanced - YARN - Lab Guide


3. In the list, select Clickstream_context and then click OK.
Your variables now appear in the Variable table:

Run your job. You should not have any error message. Otherwise, check your Hadoop connection, and your services in Hue.
At the end on job's execution, you can check your results using Hue File Browser. You should find your files in your directory "/user-
/student/clickstream/input".

Retrieve schema
In your next job, you will need to read those files into HDFS and, in order to read them, you will need the schema of each file. reguser-
s.tsv, urlmap.tsv and states.tsv have simple schemas, so they can be done by hand. But for Omniture.tsv, there are more than 170
columns, so it can be difficult to create the schema by hand.
Talend wizards can help you to retrieve a schema from a file. You will use it to retrieve the schema of each file.
1. In the Repository, find HDFS_connection metadata.
2. Right-click HDFS_connection and then, click Retrieve schema. This operation will work only if you have a working con-
nection to your cluster.
3. In the opened browser, navigate to /user/student/clickstream/input and then select the 4 .tsv files.
4. In the creation status column, you will see the status changing from Pending to Success. Then, you can check the files size
and the number of columns found. You should see 173 columns for Omniture.tsv, 3 columns for regusers.tsv, 4 columns for
states.tsv and 2 columns for urlmap.tsv.
5. Once you have a success status for all files, click Next.
6. You will start with the easiest files to understand the process.
In the Schema table, click regusers.

LESSON 1 | 21
If the number of columns doesn't correspond to what you expected, you can click Guess Schema and confirm changes by
clicking OK in the popup warning message.
7. For regusers.tsv file, the wizard found 3 columns and named them Column0, Column1 and Column2.
Click in the Schema table to edit those names and change them to SWID, BIRTH_DT and GENDER_CD, as follows:

8. Following the same process, rename states columns as : Postal, State, Capital and MostPopulousCity
Rename urlmap columns as : url and category
9. The wizard will guess the schema but won't give name to your columns. If you look at the Omniture schema, you will see that
there are 178 columns of various types and length. To simplify the processing, it is necessary to set all those columns to
String Type. And as it can be quite long to do, the schema for Omniture file has already be done for you. You can find it in:
Metadata/Generic schemas/Omniture_schema.
10. In the Schema list, click Omniture and then click Remove Schema.
11. Click OK to confirm deletion.
12. Click Finish to create your schema. You should see them under HDFS_connection in Repository:

Next, you will use these schema to read files into HDFS and process them to enrich logs with users information and try to extract inter-
esting facts.

22 | Big Data Advanced - YARN - Lab Guide


Enrich logs

Overview
Now, the goal is to do add value to your data set. For example, compute some statistics and try to figure out what categories of art-
icles women and men are interested in.
You will build a simple job to combine Omniture weblogs with products and user information.
At the end of this section, your job will look like the following:

First, you need to create a new job to read Omniture logs and products data into HDFS.

Read files into HDFS


1. In the Repository, create a new standard Job naming it join_Omniture_logs.
2. In the Designer view, add 2 tHDFSInput components.
3. Double-click tHDFSInput_1 to open the Component view.
4. In the View tab, change the component's name to OmnitureLogs.
5. In the Basic Settings tab, in the Property Type list, select Repository, and browse to find HDFS_connection
metadata.
6. In the Schema list, select Repository, and then browse to find Omniture_schema, under Metadata/GenericSchemas.
7. In the Folder/File box,enter:
context.HDFS_ROOT_DIRECTORY+"input/Omniture.tsv"
Note thet you can use the Ctrl-Space shortcut to get the context variable name.

LESSON 1 | 23
Your configuration should be as follows:

8. Double-click the tHDFSInput_2 label and then name it Products.


9. Configure your component as follows :

24 | Big Data Advanced - YARN - Lab Guide


Mapping weblogs and products
With a tMap component, you will combine data from Omniture and urlmap. The goal here is to join the 2 datasets using URL. In fact,
in urlmap.tsv, you can find the equivalence between URLs and categories.
So, you will extract logdate, IP, url, user ID (SWID), city, country, state and category data. You will also store urls rejected by the join
operation.
1. Add a tMap component to your design and connect it as shown below:

Remember to rename the different rows, and components in your design. It will be easier to configure your mapping if your
rows have names easy to understand.
2. Double-click the Lookup_products component to open the Map Editor.
3. In the left bottom part of your Map Editor, you will see the products table, with 2 columns, url and category.
Click in the Expression key box of url to edit it and then enter:
logs.Column12
This is equivalent to a drag and drop operation, but more convenient as the logs table is huge. Column 12 of logs cor-
responds to URL data.
4. Click the tMap settings button.
In the Match model list, select All matches.
In the Join Model list, select Inner Join.

5. Click the green plus sign to create a new output and then name it logs2.

LESSON 1 | 25
6. Configure logs2 as follows:

7. In the schema of logs2, set the logdate column type to Date.


8. Click the green plus sign to create another output and then name it rejects. This new output will capture all rejected
URLs.
Configure rejects as follows:

9. Click OK to save your mapping.

Mapping with users


1. Add a tHDFSInput component to read regusers into HDFS.
2. Set the Property Type to Repository and browse to find HDFS_connection.
3. Set Schema to Repository and browse to find the regusers schema in Metadata/Hadoop Cluster-
/TrainingCluster/HDFS/HDFS_connection.
4. In the Folder/File box, enter
context.HDFS_ROOT_DIRECTORY+"input/regusers.tsv"

26 | Big Data Advanced - YARN - Lab Guide


Your configuration should be as follows:

You have read the regusers file, so you can do the mapping with a new tMap component. From the regusers file, you need to get the
gender for each identified user (swid as key).
1. Add a tMap component to the right side of the Lookup_products component, then connect it and rename rows and com-
ponents as follows:

2. Double-click Lookup Users to open the Map editor.


3. Click swid in logs2 and then drag it to SWID column of the user table.

LESSON 1 | 27
4. Configure your join operation, as follows :

5. Add a new output and then name it logsout.


6. Select all columns in logs2 and then drag them to logsout.
7. Select GENDER_CD column in users, drag it to logsout,and then
change the column name to gender into the logsout schema.
8. Click the ellipsis (...) button to edit genderexpression in the logsout table and then ,enter:
users.GENDER_CD != null ? users.GENDER_CD : "U"
This operation is needed to change null values to U, which means Unknown.
Your logsout table should look like the following:

9. Click OK to save your mapping.

Store results in HDFS


You have enriched your logs with gender and categories, so it could be interesting to store this into HDFS, to make it available for fur-
ther investigations. You will also store rejected URLs from the first join operation.
As usual, you will use a tHDFSOutput component to save your data into HDFS.
1. Add 2 tHDFSOutput components to your design.
Place the first one below the Lookup_products component, to store rejected URLs and then connect it with the rejects
row.
2. Place the second one at the right side of the Lookup Users component, to store enriched logs and then connect it with the
logsout output.

28 | Big Data Advanced - YARN - Lab Guide


3. Rename components and rows as follows:

4. Double-click Rejects to open the Component view.


5. As done previously, you will use HDFSconnection to write rejected URLs into HDFS. Your data will be saved in :
context.HDFS_ROOT_DIRECTORY+"input/rejects/url_rejects"
Your configuration should be as follows:

6. Double-click Results to open the Component view.


7. You will still use HDFS_connection to write your join results. Your data will be saved in :
context.HDFS_ROOT_DIRECTORY+"results/join_results"

LESSON 1 | 29
8. Your configuration should be as follows:

9. Click (...) to edit the schema then click the floppy disk icon to save the schema.
10. Name the schema ResultsSchema. Once created, it will appear in the Repository under Metadata/Generic Schemas.

Run Job and check results


Now, your job to enrich logs should be running. A first verification you can do is to check the data you will read into HDFS using the
Data Viewer.
1. Right-click the OmnitureLogs component and then click Data Viewer at the bottom of the contextual menu.
This will open the Data Preview window:

30 | Big Data Advanced - YARN - Lab Guide


2. Run your job.
3. In Hue, open the File browser and then navigate to
/user/student/clickstream/input/rejects/url_rejects.
The file is empty because no URLs were rejected.
4. Navigate to /user/student/clickstream/results and then click the join_results file to see the result of join operations:

Next, you will analyze your data in the same job and convert this standard job into a Big Data Batch Job.

LESSON 1 | 31
Compute statistics

Overview
In the previous section, you enriched logs with the gender of users and the categories of products. Next you will process the data to
extract statistical information, such as how many men navigated to clothing web pages, or how many women navigated to computer
web pages.
At the end of this section, your job will look like the following:

First, you will need to read the enriched logs from HDFS.

Read enriched logs


1. In the Designer view, add a tHDFSInput component below OmnitureLogs and thenconnect it with an "OnSubjobOk" trig-
ger.
2. Double-click to open the Component view.
3. In the Property Type list, select Repository, and browse to find HDFS_connection.
4. In the Schema list, select Repository and then use the resultsSchema generic schema metadata.
5. Your schema should be as follows:

32 | Big Data Advanced - YARN - Lab Guide


6. Click OK to save the schema.
7. In the File Name box, enter:
context.HDFS_ROOT_DIRECTORY+"results/join_results"

Filter data
Before computing the statistics, you will filter the data to keep only users for which the gender is known. That means, eliminate
undetermined or empty values. You will perform this with a tFilterRow component.
1. Add a tFilterRow component at the right side of tHDFSInput_4 and then connect it with the Main row.
2. Double-click tFilterRow to open the Component view.
3. Click Edit Schema and then select all columns in the Input table, and push them into the Output table.
4. Click OK to save your schema.
5. Under the Conditions table, click the green plus sign twice to add two conditions and then configure as follows:

Compute statistics
Once cleaned-up, you can process the data using a tAggregateRow component. This will allow you to count the number of men and
women per product category.
1. Add a tAggregateRow component at the right side of the tFilterRow component and then connect it with a Filter row.
2. Double-click to open the Component view.
3. Click Edit Schema and then configure it as follows:

There are three Output columns : category, gender and Nb.


4. Click OK to save the schema.
5. Below the Group by table, click the green plus sign twice to add two lines and then configure it as follows:

LESSON 1 | 33
6. Below the Operations table, click the green plus sign to add a line and then configure it as follows:

Run job and check results


To check your results, you will use a tLogRow component.
1. Add a tLogRow component at the right side of the tAggregateRow component and then, connect it with a Main row.
2. Double-click to open the Component view.
3. In Mode, select Table.
Your job should be complete and look like the following:

4. Run your job.


You should see this result in your console:

34 | Big Data Advanced - YARN - Lab Guide


Next, you will convert this standard job into a Big Data Batch MapReduce job.

LESSON 1 | 35
Convert to Big Data Batch Job

Overview
Now, your job is running and you could compute several statistics. In the case you had to analyze months of logs, this job would be
much more efficient if it could benefit of the Map Reduce framework of Hadoop.
You will transform your standard job in a Big Data Batch Job, running on YARN and then run it again to verify that this converted job
still works.
At the end of this section, your job will look like the following:

Convert to Big Data Batch Job


1. Save your job and close it.
2. In the Repository, under Standard jobs/ClickStream, right-click join_Omniture_logs and then, click Edit properties in
the contextual menu.
3. In the Job Type list, select Big Data Batch.

36 | Big Data Advanced - YARN - Lab Guide


4. In the Framework list, select MapReduce.
5. Click Finish:

To run a Big Data Batch Job, the cluster information must be given at the Job level, not only at the components level. this pop
up window allows you to choose from which component you want to import the cluster configuration.
6. Click OmnitureLogs and click OK.
This will convert your Job to a Big Data Batch Job. It will appear in the repository under Big Data Batch /ClickStream:

LESSON 1 | 37
7. Open join_Omniture_logs job.
When you convert a standard job in a MapReduce job, you may have to re-configure some components which are not
exactly the same in a Standard job and in a MapReduce job. Or some components may not be available in a MapReduce job
(marked with a red cross). You may also have to configure the connection to your Hadoop cluster.
8. In the Run tab, click Hadoop Configuration.
9. Some information is missing in the cluster configuration.
In the Property Type list, select Repository and then browse to find TrainingCluster.
10. Double-click the Results component and then configure it as follows:

11. Double-click the Rejects component and then configure it as follows:

Note: You can also use the context variable HDFS_ROOT_DIRECTORY created earlier.

Run job and check results


Now run your job. You can follow the execution of your job in the Console, in the Designer view (progress bars) or in Hue.
At the end of the execution, your should see the statistics you computed in the Console:

38 | Big Data Advanced - YARN - Lab Guide


There are many ways to achieve the same result, but some solutions may be more efficient than others. So next, you will see how
your Map Reduce job is built.

LESSON 1 | 39
Understanding MapReduce jobs

Overview
When you build a MapReduce job, you can see light blue rectangles appearing on your design. They correspond to MapReduce jobs
that will be sent on your Hadoop Cluster. Depending on the way you connect your components, and the version of Talend you use,
the number of MapReduce jobs may differ.
You will add a new subjob that will achieve the same result as before, with the same components but connected differently.
At the end of this section, your job will look like the following:

Read enriched logs


1. In the Designer view, add a tHDFSInput component below the last tHDFSInput and thenconnect it with an "OnSubjobOk"
trigger.
2. Double-click to open the Component view.
3. In the Property Type list, select Repository, and browse to find HDFSconnection.
4. In the Schema list, select Repository, and then use the resultsSchema generic schema metadata.

40 | Big Data Advanced - YARN - Lab Guide


5. Your schema should be as follows:

6. Click OK to save the schema.


7. In the File Name box, enter:
context.HDFS_ROOT_DIRECTORY+"results/join_results"

Aggregate data
Next, you will aggregate the data with a tAggregateRow component.
1. Add a tAggregateRow component at the right side of the tHFDSInput component and then connect it with a Main row.
2. Double-click to open the Component view.
3. Click Edit Schema and then configure it as follows:

There are three Output columns : category, gender and Nb.


4. Click OK to save the schema.
5. Below the Group by table, click the green plus sign twice to add two lines and then configure it as follows:

6. Below the Operations table, click the green plus sign to add a line and then configure it as follows:

LESSON 1 | 41
Filter data
Now, your subjob computes the number of men "M", women "F", unknown "U" and empty gender for each category. As you are only
interested in statistics for men and women, you can filter the result, to keep only what is needed.You will perform this with a tFil-
terRow component.
1. Add a tFilterRow component at the right side of tAggregateRow and then connect it with a Main row.
2. Double-click to open the Component view.
3. Click Sync columns since you will have the same columns before and after the filter.
4. Under the Conditions table, click the green plus sign twice to add two conditions and then configure as follows:

Run job and check results


To check your results, you will use a tLogRow component.
1. Add a tLogRow component at the right side of the tFilterRow component and then, connect it with a Filter row.
2. Double-click to open the Component view.

42 | Big Data Advanced - YARN - Lab Guide


3. In Mode, select Table.
Your job should be complete and look like the following:

LESSON 1 | 43
4. Run your job.
You should see this result in your console:

Understanding MapReduce Jobs


This is exactly the same result as before. But if you look at your design, you simply inverted aggregation and filtering steps. Depend-
ing on the version of your Talend studio, this inversion can have an impact on your MapReduce tasks.For releases before Talend
Platform for Big Data 5.5, the first way to compute statistics required one MapReduce task, and the second required two MapRe-
duce tasks resulting in a suboptimal MapReduce generated code which can lead to poor performances. Below is the same job in the
Platform for Big Data 5.4.1:

44 | Big Data Advanced - YARN - Lab Guide


It is a good practice to design optimization into your MapReduce job from the beginning if you are working on an earlier version of the
Studio. Since release 5.5, the Studio has been improved to generate better MapReduce code and to significantly improve per-
formances of MapReduce jobs.
Each time it is necessary to write data on a drive, this will create a new MapReduce task (green rectangles). In the current subjob,
you have different operations on your data, mainly filtering and aggregation. Simple operations such as filtering rows and columns
will only require Map tasks. On the contrary, to aggregate or sort data, it is necessary to collect them first, which produces inter-
mediate files i.e. writing on the drive, and needs the reduce task.To determine the span of each green rectangle, there is a simple
rule. You can have several consecutive map tasks, then a unique reducer task followed by additional map tasks. In conclusion, only
one reduce task is allowed per MapReduce task.
To see the generated code for a particular component:
1. In the Designer view, click the tAggregateRow_1 component.
2. Click the Code tab, next to Designer tab.
You should reach the corresponding piece of code:

LESSON 1 | 45
In this part of the generated code, you can see how map and reduce tasks are added to the general processing of your job.
3. Press Ctrl-F to find tAggregateRow_1Mapper if you want to have details of the map task:

The convention for the mapper and reducer function name for a component is:
ComponentNameMapper
ComponentNameReducer
Now, it is time for you to reinforce your knowledge with an exercise.

46 | Big Data Advanced - YARN - Lab Guide


Challenges

Overview
Complete these challenges to further explore MapReduce jobs. See Solutions for possible solutions to these exercises.

Challenge
As an exercise, you will now enrich your design to find the following information :
1. How many men and women, from New-York state (NY) navigated the different web pages?
2. What did they buy?
3. Sort categories by descending order to find the category the most/less interesting for men and women and how many men
and women clicked the corresponding web page.

Once you are done with the Challenge, it is time to Wrap up.

LESSON 1 | 47
Solutions

Overview
These are possible solutions to the challenge. Note that your solutions may differ and still be valid.

How many men/women from NY state?


1. How many men and women, from New-York state (NY) navigated the different web pages.

- First, you should read the join results into HDFS with a tHDFSInput component.
Add a tHDFSInput component below tHDFSInput_4 and then connect it with an "OnSubjobOK" trigger.
- Double-click to open the Component view.
- Select Repository in the Property Type list and then browse to find the HDFS_connection metadata.
- In the Folder/File box, enter:
context.HDFS_ROOT_DIRECTORY+"results/join_results"

- Next, you should filter your data to keep rows with only men and women gender, and with NY state value.
- Add a tFilterRow component at the right side of tHDFSInput_4 component and then connect it with a Main row.
- Double-click tFilterRow to open the Component view.
- Set the three conditions as follows:

- And then, you can count how many men and women from NY state clicked on the web pages, with a tAggregateRow com-
ponent.
Add a tAggregateRow component at the right side of tFilterRow and then connect it with the Filter row.
- Double-click to open the Component view.
- Click Edit Schema and then add 2 output columns : gender and nb.
Your data will be grouped using gender column and Nb column will store the count of ip. So, the type of gender column is
String and the type of Nb is integer:

Your data will be aggregated as follows:

48 | Big Data Advanced - YARN - Lab Guide


- To read the results of your aggregation, add a tLogRow component at the right side of the tFilterRow component and then
connect it with a Main row.
- Double-click to open the Component view and then select Table in Mode.

- Now you can run your job and in the console you should have the following result :

What did they buy?


1. What did they buy?

- In order to keep the previous computation, you will replicate the output of the tFilterRow component.
- Delete the row between tFilterRow and tAggregateRow.
- Insert a tReplicate component.
- Connect tFilterRow to tReplicate with the Filter row, and then connect tReplicate to tAggregateRow with a Main row.

- To know what categories the customers where interested in, you will use another tAggregateRow component.
- Add a tAggregateRow component below the previous tAggregateRow component, and then connect it to tReplicate with
a Main row.
- Double-click to open the Component view.
- Click Edit schema, and then add three output columns: category (String), gender (String) and Nb (Integer)
- Your data will be grouped by category and gender.
- Nb will store the count of ip:

LESSON 1 | 49
Most/Less interesting categories?
1. Sort categories by descending order to find the category the most/less interesting for men and women and how many men
and women clicked the corresponding web page.

- To sort the result of the aggregation step, add a tSortRow component at the right side of the tAggregateRow component
and then connect it using a Main row.
- Double-click to open the Component view.
- Below the Criteria table, click the green plus sign to add a new line.
- Set the Schema column to Nb, select num in the "sort num or alpha?" list and then select desc in the "Order asc or desc?"
list.

- To visualize your results, add a tLogRow component at the right side ot tSortRow and the connect it with a Main row.
- In the Component view, select Table in Mode.

- Run your job, and you should see the following result in the console:

50 | Big Data Advanced - YARN - Lab Guide


Wrap-Up

Recap
In this lab use case, you created a job with a Hadoop cluster connection metadata and context variables. You also loaded multiple
files into HDFS, used Talend wizards to retrieve their schema and checked data with the Data Viewer. Then, you converted this
standard job into a MapReduce job, and you saw the impact of the order of connection of your components on MapReduce tasks
sent to your cluster.

Further Reading
For more information about topics covered in this tutorial, see the Talend Platform for Big Data User Guide.

LESSON 1 | 51
This page intentionally left blank to ensure new chapters
start on right (odd number) pages.
LESSON 2
Sentiment Analysis Use Case
This chapter discusses the following.

Sentiment analysis 54
Load Lookup Data Into HDFS 55
Load tweets into HDFS 62
Process tweets with MapReduce 74
Challenge 81
Solutions 82
Wrap-Up 84
Sentiment analysis

Overview
This lab use case describes how to perform sentiment analysis from Twitter tweets using Talend Big Data Batch Map Reduce jobs.
Sentiment analysis is the process of analyzing texts (i. e. opinions of people) from social media websites and mapping the polarity of
these opinions. Sentiment Analysis enables organizations to be pro-active to the opinions being expressed on social media websites.
In this lab, you will stream all Tweets related to a #hashtag value for a brief period and then provide analysis on the hashtag sentiment
and geolocations.

Objectives
After completing this lab use case, you will be able to:
Use Twitter API with Talend components
Send data to Hadoop HDFS using Talend components
Develop and run a Big Data Batch job, using the Map Reduce framework

Before You Begin


Be sure that you are working in an environment that contains the following:
A properly installed copy of Talend Platform for Big Data with a valid license
A properly configured Hadoop cluster
The supporting files for this tutorial
Everything has already been setup for you in your virtual environment.
The first step is to load the data dictionary into HDFS.

54 | Big Data Advanced - YARN - Lab Guide


Load Lookup Data Into HDFS

Overview
This job will load Dictionary and Time zone lookup data needed for analysis into HDFS. This job is executed once, if it has never been
executed before.
Warning : If it is executed after an analysis, it will delete all analysis data and restart in an empty folder.
At the end of this section, your job should look like the following:

First, you will create a new job and clean-up HDFS folders where you will save lookup and analysis data.

Clean-up HDFS
1. In the Repository view, under Job Designs, create a new folder and name it Sentiment.
2. Right-click Sentiment and then click Create Standard Job in the contextual menu.
3. In the Name box, enter LoadLookup and then click OK to create your job.
4. Add 2 tHDFSDelete components in the Designer view and then connect them with an "OnSubjobOk" trigger, as follows:

LESSON 2 | 55
5. Double-click tHDFSDelete_1 to open the Component view.
6. Use the HDFS_connection metadata.
7. In the File or directory Path box, enter "/user/student/sentiment/lookup":

8. Repeat the same process for tHDFSDelete_2, except for the folder to be deleted which will be different.
In the File or directory Path box, enter "/user/student/sentiment/twitter_analysis":

Next, you will read the dictionary file and load it into HDFS.

56 | Big Data Advanced - YARN - Lab Guide


Load Dictionary data into HDFS
Dictionary data are written in a .tsv file. This dictionary is a list of words and how they should be interpreted. This list provides
strength/weakness of a word and if this word is negative or positive. This is useful to quantify the impact of the message.
To see the file content, you can navigate to C:/StudentFiles/Sentiment/dictionary/dictionary.tsv, and then open it in a text editor.

In your Talend job, you will read the content of the file and then load it into HDFS. You will need 2 components to achieve this task.
tFileInputDelimited will read the file and tHDFSOutput will load data into HDFS.
1. Add a tFileInputDelimited component below tHDFSDelete_2 and then connect it with a OnSubjobOk trigger.
2. Double-click tFileInputDelimited_1 to open the Component view.
3. In the File name/Stream box, enter
"C:/StudentFiles/Sentiment/dictionary/dictionary.tsv". Or you can click the ellipsis (...) button to open a browser and find your
file.
4. Enter "\n" (including quotes) in the Row Separator box, and then, enter "\t" (including quotes) in the Field Separator box.
This means that you have carriage return symbol at the end of each row, and that you have tabs as column separator:

To make your job easier to understand, you can change the label of the component to be Dictionary instead of tFileIn-
putDelimited_1.
5. Click Edit Schema and then click the green plus sign to add Columns.
Configure the schema as follows:

6. Click OK to save the schema.


7. Click Advanced Settings in the Component view.

LESSON 2 | 57
8. In the Encoding list, select CUSTOM and then enter "US-ASCII" (including quotes) in the box.
This is the encoding type of the file.
9. Add a tHDFSOuptut component at the right side of the Dictionary component and then connect it with a Main row.
10. Double-click tHDFSOutput_1 to open the Component view.
11. Use the HDFS_connection metadata.
12. In the File Name box, enter
"/user/student/sentiment/lookup/dictionary".
13. In the Action list, select Overwrite.
This will allow you to run the job as many times as needed without any error message because the file already exists.
14. Select the Include Header option.

Your configuration should look like the following:

15. Click the Edit Schema button to make sure you have the following configuration:

58 | Big Data Advanced - YARN - Lab Guide


If your job runs correctly, you should be able to see dictionary file in HDFS using Hue. You will check this later in this section.
Next, you will load time zone data into HDFS.

Load Time zone data into HDFS


You will repeat the same process to load time zone data into HDFS.
1. Add a tFileInputDelimited component below the Dictionary component and then connect it with a OnSubjobOk trigger.
2. Add a tHDFSOuput component at the right side of the tFileInputDelimited component.
3. Rename components and connect them as shown below:

4. Double-click the Time_zone file component to open the Component view.


5. In the File name/Stream box, enter "C:/StudentFiles/Sentiment/time_zone_map/time_zone_map.tsv".
If you edit the file, you will see its content composed of 2 columns : time zone and corresponding country.
6. Enter "\n" in the Row Separator box and then, enter "\t" in the Field Separator box.
7. Enter "1" (without quotes) in the Header box.
8. Click Edit Schema and configure it as shown below:

9. Click OK to save the schema.

LESSON 2 | 59
10. Click Advanced Settings in the Component view.
11. Select CUSTOM in the Encoding list and then enter "US-ASCII" in the box.
12. Double-click the Time_zone HDFS component to open the Component view.
13. Provide the same information on your Hadoop cluster as in the Dictionary HDFS component , using the HDFS_con-
nection metadata.
14. In the Filename box, enter
"/user/student/sentiment/lookup/timezone".
15. Configure your component to overwrite existing files.
16. Select the Include Header option.
Your configuration should look like the following:

17. Click Edit schema and make sure it is configured as follows:

Now you can run your job and check the results in HDFS.

60 | Big Data Advanced - YARN - Lab Guide


Run Job and check results
In the Run view, click the Run button. Your job should succeed without any errors.
Once your job executes, you can check your results using Hue.
1. Open Hue in your browser.
2. Click on File Browser icon.
3. Navigate to files in path "/user/student/sentiment/lookup".
You should find your files as shown below:

Next step is to import your tweets into HDFS.

LESSON 2 | 61
Load tweets into HDFS

Overview
This job will read tweets from Twitter using the output of a call to the Twitter REST API and then load the tweets into HDFS.

The Twitter REST API requires proper setup of a developer account and an authentication key (OAuth Token). Then, you will be
able to send queries and collect Tweets related to a particular #hashtag value for a brief period.

The data collected from Twitter are in JSON format. Here is an extract of what will be received and processed, displayed in tree view
mode:

At the end of this section, your job should look like the following:

In this job, there are a few operations such as extracting from each Tweet only the valuable information, standardizing the TEXT
(message) of tweets and apply a first layer of transformation.
The process uses a dictionary of positive, negative or neutral words to determine the sentiment of the tweets as well.
The requested hashtag will be defined in a context variable, so feel free to try any hashtag.

62 | Big Data Advanced - YARN - Lab Guide


Note : This job requires access to the internet to run, as it is querying Twitter live.
If you want to know more about Twitter API, navigate to : https://dev.twitter.com/docs/using-search.
If you want to know more about Authentication, navigate to : https://dev.twitter.com/oauth/overview/authentication-by-api-family

Query Twitter
First you will use a tRESTClient to query twitters with an hashtag of interest.
1. Create a new Standard Job naming it ImportTweets.
2. Add a tRESTClient component in the Designer view.
3. In the URL box, enter
"https://api.twitter.com/1.1/search/tweets.json"
4. In the HTTP Method list, click GET and then, in the Accept Type list, click JSON.
5. In Query parameters, you can set options and build your own query. The available options are described at : https://dev.t-
witter.com/rest/reference/get/search/tweets
You will use 3 options : q, count and result_type.
Below the Query parameters table, click the green plus sign to add three parameters.
6. For the first parameter, set name to "q"(including quotes) and then, set value to "context.HASHTAG_QUERY" (without
quotes).
This first parameter will define the hashtag of interest. And this particular hashtag will be stored as a context variable.
You will come back to how to define this later in the lab.
7. The second parameter's name is "count" (including quotes) and its value is "180"(including quotes) .
This is the maximum number of tweets you will get from this query.
8. The third parameter's name is "result_type" (including quotes) and its value is "mixed"(including quotes) . That means the
result will be a mix of popular and recent tweets.
Your Query parameters table should look like the following:

9. Select Use Authentication and then select OAuth2 Bearer in the list.
10. Copy and paste the following key, as a single line, in the Bearer Token box (including the quotes):
"AAAAAAAAAAAAAAAAAAAAAOnGUwAAAAAAASlaRBWm7y%
2BhNmnYmvUIBuM91Ws%3DZaDRhPsQaBHDjjcthoXYvW9
ju3hWfnbsxOFlnd4qhSwo8uSZ3h"
Note: You can use the LabCodeToCopy file to copy and paste the espression. Also, make sure that there are no spaces in
the expression due to the copy/paste operation.
Now, you will extract useful information using a tXMLMap component.
Note : You can create your own OAuth2 Bearer token following the steps described at : https://dev.t-
witter.com/docs/auth/application-only-auth

Extract useful information


1. Add a tXMLMap component at the right side of tRESTClient_1 and then connect it with a Response row.
2. Click row1 label to edit it and then change the label to rest.
3. Double-click tXMLMap to open the Map editor.
4. The schema of tweets is quite complicated and already saved for you in a .XML file. You will use this file instead of creating

LESSON 2 | 63
the schema by hand.
In the rest schema, right-click body and then click Import from file in the context menu.
5. Browse to find "C:/StudentFiles/Sentiment/twitter/twitter_schema.xml" and then, click Open to import the schema file.
6. Right-click on the user node, click As loop element in the contextual menu and then click OK to save.
The beginning of the rest table should look like the following:

7. Add a new output to the tXMLMap component clicking the green plus sign in the upper right corner of the Map Editor.
Leave the name as out1 by default.
8. In rest, click id (in body/root/statuses) and then drag it to out1.
9. Click screen_name (in body/root/statuses/user) and then drag it to out1.
10. Click text (in body/root/statuses) and drag it to out1.
11. Click the ellipsis (...) button which opens the Expression editor, to edit your new expression.
12. Copy and paste the following expression in the Expression box, to process particular characters in the text and replace
them by a white space:
[rest.body:/root/statuses/text].replaceAll(String.valueOf((char)0x0A), " ").replaceAll(String.-
valueOf((char)0x0D), " ")
Note: You can use the LabCodeToCopy file to copy and paste the espression. Also, make sure that there are no spaces in
the expression due to the copy/paste operation.

Click Ok to save the expression.

64 | Big Data Advanced - YARN - Lab Guide


13. Click lang (in root/statuses/user) and then drag it to out1.
14. Click time_zone (in root/statuses/user) and then drag it to out1.
15. Click the ellipsis (...) button which opens the Expression editor, to edit your new expression.
16. Copy and paste the following expression in the Expression box:
("").equals([rest.body:/root/statuses/user/time_zone]) ?
Tweets_Generator.getCountry() :
[rest.body:/root/statuses/user/time_zone]
This expression will test if time_zone information is empty. If it is empty, the time zone information will be replaced using
getCountry function, otherwise, the value will remain the same.
17. Click followers_count (in root/statuses/user) and then drag it to out1.
18. Click the ellipsis (...) button which opens the Expression editor, to edit your new expression.
19. Copy and paste the following expression in the Expression box:
[rest.body:/root/statuses/user/followers_count] +
[rest.body:/root/statuses/user/friends_count] +
[rest.body:/root/statuses/user/listed_count] +
[rest.body:/root/statuses/user/favourites_count] +
[rest.body:/root/statuses/user/statuses_count] +
[rest.body:/root/statuses/retweet_count] +
[rest.body:/root/statuses/favorite_count]

Note: You can use the LabCodeToCopy file to copy and paste the espression. Also, make sure that there are no spaces in
the expression due to the copy/paste operation.
20. Click Ok to save the expression.
This expression will compute the total influence of a tweet summing up different counts such as, followers count, friends
count etc...
But, for this expression to succeed, you must make sure that you are summing-up integers and not strings.
21. In the Map Editor, click Tree Schema editor (at the right side of Schema Editor, lower left corner).
22. Expand XPath columm, and change type to Integer for: rest.body:/root/statuses/user/followers_count, friends_count, lis-
ted_count, favourites_count, statuses_count,
And for rest.body:/root/statuses/retweet _count and favorite_count.
23. Click the Schema editor tab and then, click followers_count column to change its name to total_influence, in out1
schema, in lower right corner of the Map editor.
24. Select Integer in the Type list.
25. Click created_at (under root/statuses) and then drag it to out1.
Your out1 schema should look like this :

And, out1 table should look like this:

LESSON 2 | 65
26. Click OK to save your mapping.
From the original content of tweets, you extracted a few useful information, such as id, text, lang, time zone or total influence.
Now you will start to process the format of tweets before loading them into HDFS.

Standardizing text
The next step is to standardize the text in tweets. You will need a tStandardizeRow component.
1. Add a tStandardizeRow component at the right side of tXMLMap_1, and connect it with out1 output of tXMLMap_1.
2. Double-click to open the Component view.
3. In the Column to parse list, select text.
4. In the Conversion rules table, add a new rule clicking the green plus sign, below the table.
5. Set Name to "WORDS" and then set Type to Format.
6. Set Value to "(WORD)|(ALPHANUM)"
This means that in text, only words or alphanumeric characters are allowed, otherwise, the row will be rejected.
The component does not modify the input data but it will add a new column with standardized data.
7. This component generates Java code corresponding to the rules in the Conversion rules table. When you modify, add or
delete a rule, this code must be regenerated using Generate parser code in Routines.
Click Generate parser code in Routines (the gear button).
Your configuration should look like the following:

In order to be able to continue to process your text data as an XML structure, your must convert text data type. This will be achieved
with a tConvertType component.
1. Add a tConvertType component at the right side of tStandardizeRow_1 and then connect it with a Main row.
Your design should look like the following:

66 | Big Data Advanced - YARN - Lab Guide


2. Double-click tConvertType_1 to open the Component view.
3. Click the Sync columns button and then click Edit Schema.
4. In the tConvertType_1 (Output) table, change the NORMALIZE_FIELD type to Document.
5. Click OK to save the schema.
6. Select Auto Cast and then select Set Empty values to Null before converting.
"Auto Cast" will perform an automatic java type conversion. And "Set Empty values to Null before converting" will set empty
values of String or Object to null for the input data.
The next step is to combine the information from tweets to the information from the dictionary uploaded into HDFS.

Combine tweets and dictionary


1. Browse to C:/StudentFiles/Sentiment/twitter/twitter_normalized_field.xml, and then open it in an editor. You will see the
structure of NORMALIZE_FIELD. In fact, you will need this structure to combine the information between tweets and dic-
tionary.

2. Add a tXMLMap component at the right side of tConvertType_1 and connect it with a Main row.

LESSON 2 | 67
3. Add a tHDFSInput component below tXMLMap_2 and connect it as below:

4. Double-click tHDFSInput_1 to open the Component view.


5. As done earlier, configure the component to read dictionary in HDFS, as below:

6. The dictionary schema can be found in the first tFileInputDelimited component of the LoadLookup Job. Copy and paste
the dictionary schema.
7. Double-click the second tXMLMap component to open the Map Editor.
8. The next step is to set NORMALIZE_FIELD structure, according to what you have seen in the .XML file.
In main: row2 schema, under NORMALIZED_FIELD, right-click root and then select Rename in the contextual menu.
9. Enter record in the box and then click OK to save.
10. Right-click record and then click Create Sub-Element.
11. Enter "WORDS" (without quotes) in the box and then click OK to save.
12. Repeat the same procedure to create a sub-element in record, naming it UNMATCHED.
13. Repeat the same procedure to create a sub-element in UNMATCHED, naming it UNDEFINED.

68 | Big Data Advanced - YARN - Lab Guide


14. Right-click WORDS and select As loop element in the contextual menu. Click OK to save.
You should have the following configuration:

15. Select WORDS (under NORMALIZED_FIELD:record) and drag it to lookup : row3, in Expr. key corresponding to word.
16. Dictionary and tweets will be joined using lowercase words.
Click this new Expression key and click the ellipsis (...) to edit it.
17. In the Expression box, enter or copy and paste the following expression:
StringHandling.DOWNCASE([row2.NORMALIZED_FIELD:/
record/WORDS])

LESSON 2 | 69
18. Click OK. You should have the following join between tweets and dictionary:

19. Then, for each tweet, you will collect informations from main and lookup schema.
Create a new output to your second tXMLMap component and name it data.
Configure data schema as follows :

Note: Remember that you can drag columns from main or lookup to the output schema.
You can also edit the expression and use predefined functions such as getTrend.
And last, you can add new columns using the green plus sign under the table of interest, in the Schema editor.
20. In the Schema editor, change Type of trends to Integer, instead of String.
Next, you will aggregate your tweets and then load them into HDFS.

70 | Big Data Advanced - YARN - Lab Guide


Aggregate and load tweets into HDFS
1. Add a tAggregateRow component at the right side of the second tXMLMap component and then connect it using the data
row.
2. Double-click to open the Component view, click the Sync columns button and configure your component as follows:

The tweets will be grouped by id, screen name, text, lang, time zone and total influence. And for those groups, the average
trend will be computed and the first hashtag returned.
3. Click Edit Schema.
4. In the tAggregateRow_1 (Output) table, change Type of trends to Float.
5. Add a tHDFSOutput component in your design and connect it using a Main row.
Your job should be complete and look like the following:

LESSON 2 | 71
6. Configure tHDFSOuput_1 as follows:

7. Click Sync Columns.


8. Click (...) to edit the schema.
9. Click the floppy disk icon to save the schema.
10. Name the schema tweetsSchema.
Once created, the schema will appear in the Repository under Metadata/Generic Schemas.

Context variable creation


In the current job, the queried hashtag is defined as a context variable but this variable has not been created yet.
1. Next to the Component and Run view, click the Context view to open it. This is where you can create context variables.
2. Below the Variables tab, click the green plus sign to create a new variable.
3. Click in the Name column and enter "HASHTAG_QUERY" (without quotes).

72 | Big Data Advanced - YARN - Lab Guide


4. Click in the Value column to set the default value associated with your context variables and then enter "#talend" (without
quotes):

The job is complete and will request for #talend to Twitter API.
Now, you can run your job and check the results using Hue.

Run job and check results


In the Run view, click the Run button. Your job should succeed without any errors.
Once your job executed, you can check your results using Hue.
Open Hue in your browser. Click on File Browser icon and then browse to "/user/student/sentiment/tweets". You can click on your file
to see its content. You should have a content similar to the following:

The next step is to compute the trends from the tweets information.

LESSON 2 | 73
Process tweets with MapReduce

Overview
Once loaded into HDFS, the tweets can be processed. An efficient strategy to process a large amount of data is to use the MapRe-
duce framework.
In this part of the lab, you will see how to create a job that will automatically generate MapReduce code to be run on your Hadoop
cluster.
At the end of this part, your job will look like the following:

First, you will have to create a new MapReduce job.

Collect Tweets and Time zone data from HDFS


1. In the Repository, under Job Designs, right-click Big Data Batch and then click Create Big Data Batch Job.
2. In the Name box, enter "Tweets_analysis" (without quotes).
3. In the Framework list, select MapReduce.
4. Click Finish to create the job.
5. For Big Data Batch Jobs, the configuration of connection to your Hadoop cluster is done in the Run view.
In the Run view, click Hadoop Configuration.
The configuration is the same as what have been done previously and should be as follows:

74 | Big Data Advanced - YARN - Lab Guide


6. In the Designer view, add a tHDFSInput component.
This component will import the tweets from HDFS.
7. Double-click to open the Component view.
8. Set the Property Type to Repository and use the HDFS_connection metadata.
9. Set the Schema to Repository and use the tweetsSchema generic schema metadata.
The schema should be as follows :

10. In the Folder/File box, enter "/user/student/sentiment/tweets" (including the quotes), or use the (...) button to browse the
file in HDFS.
11. In the Header box, enter "1" (without quotes).
12. Add another tHDFSInput component. This one will import Time zone lookup table.
13. Double-click to open the Component view.

LESSON 2 | 75
14. Click Edit schema and configure the schema as follows :

Remember that you can use the green plus sign below table to create a new Column.
15. In the Folder/File box, enter
"/user/student/sentiment/lookup/timezone" (including the quotes).
16. In the Header box, enter "1" (without quotes).

Those 2 components will import tweets and time zone lookup table. Then, the data will be combined and aggregated to obtain the
trends.

Data mapping
First you will combine tweets and time zone table using a tMap component.
1. Add a tMap component in the Designer view and then connect and label the components as follows:

2. Double-click tMap_1 to open the Map editor.


3. Tweets and time zone will be joined using the time_zone key.
Select time_zone in tweets schema and then drag it to the timezone schema in the corresponding time_zone Expr. key.

76 | Big Data Advanced - YARN - Lab Guide


4. In the timezone schema, open tMap settings and configure the join operation as follows:

5. Click the green plus sign to create a new output to the tMap component and then name it out.
6. In the tweets table, select screen_name, text, total_influence, trends and hashtag and then drag them to out
schema.
7. In the timezone table, select country and then drag it to the out schema.
8. Add a new column to out, using the green plus sign below the out table, in the Schema editor and then name it date.
9. Edit the expression of date and then enter:
TalendDate.getDate("CCYY-MM-DD")
10. Configure the out's schema as follows :

11. Click OK to save the mapping.


You have combined tweets with time zone information. So now, you can aggregate your data to compute the tweets' trends.

Aggregate data
1. Add a tAggregateRow component at the right side of tMap_1 and then connect it with the out output of tMap_1.
2. Double-click to open the Component view.

LESSON 2 | 77
3. Click Edit Schema and configure it as follows:

4. The data will be grouped by hashtag, date and country with a sum operation on total_influence and an average
operation on trends.
Your configuration should look like the following:

Now, you can load your results into HDFS.

Load results into HDFS


1. Add a tHDFSOutput component at the right-side of tAggregateRow_1 and then connect it with a Main row.
2. In the Component view, click Sync columns and then, click Edit Schema to check the columns.
3. Use the HDFS_connection metadata.
4. In the Folder box, enter "/user/student/sentiment/tweets_analysis" (including quotes).
5. In the Action list, select Overwrite, to avoid error messages due to already existing file.
6. Check the Include Header option.
Your configuration should be as follows:

78 | Big Data Advanced - YARN - Lab Guide


7. Run your job and start Hue in your browser to check the results.

Results visualization
To visualize your results on a Map, you will run a pre-built Job. The visualization is done using the Google Charts API.
In the JobDesigns.zip archive file, in the C:\StudentFiles folder, import the TweetsVisualization Job.Then, run the job. It will gen-
erate an HTML file in "C:/StudentFiles/Sentiment/tweets_trend.html".
Browse to find the file and open it in a web browser. You should see a chart similar to this one:

LESSON 2 | 79
Note: Your results may differ as Twitter is requested live.
Now it's time to reinforce your knowledge with an exercise.

80 | Big Data Advanced - YARN - Lab Guide


Challenge

Overview
Complete this exercise to further explore MapReduce jobs. See Solutions for possible solutions to these exercises.

Exercise
In this exercise, you will modify your jobs to explore the influence and trend of another hashtag #BigData, in France and in the United
States.
1. Duplicate the ImportTweets job and name the new job ImportTweets_exercise.
2. Change the hashtag to #BigData.
3. Configure the tRESTClient component so that the "receive timeout" value is 120.
4. Configure the tHDFSOutput component to save the tweets in "/user/student/sentiment/tweets_bigdata"
5. Run your job and check results in Hue.
6. Duplicate the Tweet_analysis job and name the new job Tweet_analysis_exercise.
7. Configure TWEETS to read the new tweets.
8. Filter your data to keep only France and United States tweets.
9. Aggregate your data to compute the count of total_influence and the average of trends for each country.

Once you are done with the Challenge, it is time to Wrap up.

LESSON 2 | 81
Solutions

Overview
These are possible solutions to the Challenge. Note that your solutions may differ and still be valid.

Solution
1. Duplicate the ImportTweets job and name the new job ImportTweets_exercise.

- Right-click ImportTweets and then click Duplicate Job.


- In the Name box, enter "ImportTweets_exercise" (without quotes)

2. Change the hashtag to #BigData.

- Click the Contexts tab and then click Values as table.


- Click #talend and change it to #BigData.

3. Configure the tRESTClient component so that "receive timeout" value is 120.

- Double-click the tRESTClient component to open the Component view.


- Click Advanced settings and then change the Receive timeout value to 120.

4. Configure the tHDFSOutput component to save the tweets in "/user/student/sentiment/tweets_bigdata"

- Double-click the tHDFSOutput component to open the Component view.


- In the File Name box, enter "/user/student/sentiment/tweets_bigdata"

5. Run your job and check results in Hue.

- In the Run view, click Run.


- After job execution is complete, open your file browser in Hue to find tweets_bigdata.
-You should have something similar to this:

6. Duplicate the Tweet_analysis job and name the new job Tweet_analysis_exercise.

- Right-click ImportTweets and then click Duplicate Job.


- In the Name box, enter "ImportTweets_exercise" (without quotes)

82 | Big Data Advanced - YARN - Lab Guide


7. Configure TWEETS to read the new tweets.

- Double-click TWEETS to open the Component view.


- In the File Name box, enter "/user/student/sentiment/tweets_bigdata"

8. Filter your data to keep only France and United States tweets.

- Add a tFilterRow component at the right side of tMap_1 and then connect it with the test Output of tMap_1.
- Double-click tFilterRow_1 to open the Component view.
- Click the green plus sign to add two lines in the Conditions table and then configure as follows:

9. Aggregate your data to compute the count of total_influence and the average of trends for each country.

- Add a tAggregateRow component at the right side of tFilterRow_1 and then connect it with the Filter row.
- Double-click tAggregateRow_1 to open the Component view.
- Click Edit Schema and then click the green plus sign, below Output table, to create three outputs: country (String), influ-
ence (Integer) and trend (Float).
- Click OK to save the schema.
- Click the green plus sign below the Group by table to add a line in the table.
- Set the Output column to country and then set the Input column position to country as well.
- Click the green plus sign below the Operations table to add two lines to the table and then, configure it as follows:

- Add a tLogRow component to visualize the output of your processing.


- Run your job and you should have the following results in your console:

LESSON 2 | 83
Wrap-Up

Recap
In this lab use case,you requested Twitter API with a tRestClient component, then you extracted useful information with tXMLMap
components and finally saved it into HDFS.
You enriched your tweets data with time zone information to be able to compute influence and trends for each country in a MapRe-
duce job.

Further Reading
For more information about topics covered in this lab, see the Talend Platform for Big Data User Guide.

84 | Big Data Advanced - YARN - Lab Guide

You might also like