You are on page 1of 140

210415ilt

Cloudera DataFlow: Flow


Management with Apache
NiFi: Hands-On Exercises

Table of Contents
General Notes ............................................................................................................. 1
Hands-On Exercise: Using Your Exercise Environment ......................................... 3
Demonstration: NiFi User Interface ......................................................................... 8
Hands-On Exercise: Build Your First Dataflow ..................................................... 16
Hands-On Exercise: Start Building a Dataflow Using Processors ........................ 20
Hands-On Exercise: Connect Processors in a Dataflow ....................................... 25
Hands-On Exercise: Build a More Complex Dataflow .......................................... 29
Hands-On Exercise: Creating a Fork Using Relationships .................................... 35
Hands-On Exercise: Set Back Pressure Thresholds .............................................. 42
Hands-On Exercise: Simplify Dataflows Using Process Groups ........................... 48
Hands-On Exercise: Using Data Provenance ......................................................... 57
Hands-On Exercise: Creating, Using, and Managing Templates .......................... 71
Hands-On Exercise: Versioning Flows Using NiFi Registry ................................. 76
Hands-On Exercise: Working with FlowFile Attributes ....................................... 83
Hands-On Exercise: Using the NiFi Expression Language ................................... 94
Hands-On Exercise: Building an Optimized Dataflow ........................................ 103
Hands-On Exercise: Building Site-to-Site Dataflows .......................................... 113
Hands-On Exercise: Monitoring and Reporting .................................................. 121
Hands-On Exercise: Adding Apache Hive Controller ......................................... 125
Hands-On Exercise: Integrating Dataflows with Kafka and HDFS .................... 127

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. ii
1

General Notes

Exercise Environment Overview


This course provides an exercise environment running the services necessary to
complete the exercises.

Course Exercise Directories


The main course directory is ~/training_materials/nifi/. Within that directory
you will find the following subdirectories:

• exercises—contains exercise solution templates.

• data—contains the data files used in all the exercises.

• scripts—contains the course setup script and other scripts required to complete
the exercises.

If you have difficulty typing the ~ symbol, use /home/training/


instead. For example, the main course directory is
/home/training/training_materials/nifi/.

Working with the Linux Command Line

• In some steps in the exercises, you will see instructions to enter commands like this:

$ hdfs dfs -put mydata.csv \


/user/training/example

The dollar sign ($) at the beginning of each line indicates the Linux shell prompt. The
actual prompt will include additional information such as user name, host name, and
current directory (for example, [training@localhost ~]$) but this is omitted
from these instructions for brevity.
The backslash (\) at the end of a line signifies that the command is not complete
and continues on the next line. You can enter the code exactly as shown (on multiple
lines), or you can enter it on a single line. If you do the latter, you should not type in
the backslash.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 1
2

Viewing and Editing Exercise Files


• Command-line editors
Some students are comfortable using Linux text editors like vi or nano. These can be
run on the Linux command line to view and edit files as instructed in the exercises.

• Graphical editors
If you prefer a graphical text editor, you can use Pluma. You can start Pluma using an
icon from the remote desktop tool bar. (You can also use Emacs if you prefer.)

Points to Note during the Exercises


Step-by-Step Instructions
As the exercises progress and you gain more familiarity with the tools and
environment, we provide fewer step-by-step instructions; as in the real world, the
instructions will merely give you a requirement and it is up to you to solve the problem!
If you need help, refer to the solutions provided, ask your instructor for assistance, or
consult with your fellow students.

Catching Up
Many of the exercises in this course build on dataflows you created in previous
exercises. If you are unable to complete an exercise, you can use use one of the provided
solution dataflows in the area of the NiFi canvas labeled Solutions. When a dataflow
from a previous exercise is needed, the necessary solution dataflow is indicated
toward the beginning of the exercise. Some solution dataflows are not prerequisites
for subsequent exercises, but are there simply to show the solution for a given exercise
if needed. The exercise "Optimizing a Dataflow" is an example. Not all exercises have
solution dataflows, for example, "Monitoring and Reporting".
The solution dataflow for the peer side of "Building Site-to-Site Dataflows" and
the solution dataflows for "Integrating Dataflows with Kafka and HDFS" are not on
the canvas initially. These are templates that must be imported and instantiated
individually if you need them (they are not prerequisites for any other exercises).
The template files for these dataflows are site-to-site-dataflow-solution-
peer.xml and integrate-dataflow-solution.xml, located in ~/
training_materials/nifi/exercises, as are all other individual dataflow
solutions. You learn to import and instantiate templates in this course.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 2
3

Hands-On Exercise: Using Your Exercise Environment


In this exercise, you will connect to your exercise environment.
Be sure to read the General Notes section above before starting the exercises.

Connecting to Your Exercise Environment


Connecting to Your Remote Host

1. In your local browser, open the URL provided by your instructor to view the
exercise environment portal.

2. The environment portal page displays thumbnail images for your exercise
environment hosts. The hosts should be started already (indicated by a green
background on the thumbnail image), but might be suspended or powered off,
indicated by a gray or blue background.
There are two hosts: Master and Peer.

a. If the Peer host is running, shut it down by clicking the “Power options for this
VM” icon above the thumbnail (indicated by a power button symbol).

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 3
4

b. Select Shut down.

Note: The Peer host is not used until the exercise in which a site-to-site
dataflow is built later in the course. It saves resources to keep it turned off until
it is needed.

c. If the Master machine is suspended or powered off, start the machine by


clicking the play button (triangle icon). It will take a several minutes to start.

3. Click the host thumbnail of the Master host to open a new window showing the
remote host machine.
The Master remote host desktop will display. The exercises refer to this as the
“remote desktop” to distinguish it from your own local machine’s desktop.
All exercises in the course are performed on the remote desktop of the Master host
except for the site-to-site exercise, which is completed using both the Master and
Peer host systems.

Verifying Your Cluster Services


When you start or restart your exercise environment hosts, the cluster services on the
host will automatically start. It can take up to 15 minutes for all services to start up
fully.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 4
5

4. Open a new terminal window using the remote host desktop shortcut.

5. Run the service status verification script:

$ check-health.sh

6. Confirm that all required services are noted as “good.” Required services: HDFS,
Hive, Kafka, NFiFi, NiFi Registry, Zookeeper.

7. If any of the required services are “bad”, wait several minutes and then try again. (It
can take up to 15 minutes for all services to start fully.) If they are still not running
correctly, try restarting all the services by running the following script:

$ start-cluster.sh

Wait until the script completes and then check the health again.

Your exercise environment is now ready.

Stopping Your Remote Host


Your remote host will be shut down at the end of class, but it will continue to be
accessible to you for a limited time after that.
In order to minimize your usage of the limited time available, your remote host will be
automatically shut down after thirty minutes of idle time. You can also stop it manually.
To shut down your environment, click the stop button (a square) on the client browser
toolbar at the top of the remote desktop display.

Follow the instructions above in the Connecting to Your Remote Host section to restart
your environment.

Using the Remote Desktop


The Browser Client Toolbar
Use the browser client toolbar at the top of the remote desktop display to control the
remote virtual machine host.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 5
6

The table below describes the usage of some important toolbar functions.
Return to the environment portal

Shut down the remote host

Select keyboard language

Copy text between the remote desktop’s clipboard and your local
machine’s desktop
Resize the desktop to the size of your local browser window

Increase or decrease the remote desktop’s resolution

Using the Remote Desktop


The remote desktop uses the MATE window manager. The MATE toolbar Applications
menu gives you access to a variety of applications on the remote virtual machine. It also
provides shortcuts to start Firefox, a terminal window, or the Pluma editor.
You can start a file browser using a desktop shortcut.

Note: Depending on the size of your browser window, you might need to hide the
browser client toolbar to be able to see the full desktop toolbar.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 6
7

Optional: Downloading and Viewing the Exercise Manual on Your


Remote Desktop
In order to be able to copy and paste from the Exercise Manual (the document you are
currently viewing) to your remote desktop, you need to view the document on your
remote machine rather than on your local machine.

1. Download the Exercise Manual

a. Go to https://university.cloudera.com/user/learning/
enrollments and log in to your account. Then from the dashboard, find this
course in the list of your current courses.

b. Select the course title, then click to download the Exercise Manual under
Materials.
This will save the Exercise Manual PDF file in the Downloads folder in the
training user’s home directory on the remote host.

2. View the Exercise Manual on Your Remote Host

a. Open a terminal window using the shortcut on the remote desktop, and then
start the Atril PDF viewer:

$ atril &

b. In Atril, select menu item File > Open and open the Exercise Manual PDF file in
the Downloads directory.

This is the end of the exercise.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 7
8

Demonstration: NiFi User Interface


In this demo, you will learn about the NiFi user interface.
You will explore navigating the NiFi canvas, adding components to the canvas, and
using the global menu.

Demo Instructions
Explore Canvas Navigation

1. If you have not already done so, start the Firefox browser on your remote desktop.

2. You will see browser bookmarks for NiFi and NiFi Registry.

Click on the NiFi bookmark to see the NiFi canvas.


The URL is http://master.example.com:8080/nifi.

3. In the “bird’s-eye view” area of the Navigate palette, drag the rectangle
representing the view left and right to pan the canvas. Also try scrolling by dragging
the canvas itself.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 8
9

4. Try dragging the panning rectangle up and to the left, so that it extends out of the
bird’s-eye view of the canvas. When you release the rectangle, the available canvas
will grow.
You can do this whenever your canvas is full and you need a clear area to work on.

5. Use the + and - buttons on the Navigate palette to zoom in and out. Also try
zooming by using your mouse’s scrolling function.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 9
10

6. Use the Navigate palette “fit” and “actual” buttons to adjust the canvas view.

Explore Flow Components and the Operate Palette

7. Select a processor component from the toolbar at the top of the canvas and drag it
onto a clear area of the canvas.

8. The Add Processor dialog window will be displayed showing a list of available
processor types.
In the search box in the upper right, enter the text lists3. This will filter the list to
locate the ListS3 processor.

Note: The ListS3 processor retrieves a list of files from an Amazon Web Services
S3 bucket. However, this demo will not show the usage of the processor, only how
to create and manipulate a processor on the canvas.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 10
11

9. Select ListS3 in the list of processors and click the ADD button. The ListS3
processor will appear on the canvas.

10. Click the component to select it, then try dragging it to a different location on
the canvas. Moving processors allows you to keep your canvas and dataflows
organized.

11. Locate the Operate palette (below the Navigate palette.) Selecting a component
on the canvas activates buttons to operate on component. Note that the palette
displays the name of the currently selected component.
While the ListS3 component is selected, click the “Copy” button to copy the
selected component to the clipboard.

Note: You can also copy a component by right-clicking on the component to bring
up the context menu and selecting Copy.

12. Click on the canvas to unselect the ListS3 component, then click the “Paste”
button on the Operate palette. A copy of the first component will be pasted near the
existing one.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 11
12

Note: You can also paste a component by right-clicking on the canvas to bring up
the context menu and selecting Paste. (The Paste option will only show if you have
copied a component to the clipboard.)

13. Select the two components you created. You can select multiple components one of
two ways:

• Hold down the SHIFT key while you click on the components you want to select.

• Hold down the SHIFT key while you drag your pointer on the canvas and draw a
rectangle around the components you would like to select.

Using one of these methods, select the two ListS3 processor components you
added earlier.

14. Note that the Operate panel now says Multiple components selected. Selecting
multiple components activates the “Group” button, which will move your
components into a process group. Click the “Group” button.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 12
13

15. The “Group” button will prompt you for a group name. Enter Demo Group and
click ADD.
The two components you selected will be replaced on the canvas by a process
group.

16. Double-click on the Demo Group process group to display the process group’s
canvas. The canvas should include the two processors you added to the group.

17. Return to the root canvas by clicking the NiFi Flow link in the breadcrumb trail in
the lower left.

Explore Global Controls

18. Open the Global Menu, indicated by three horizontal lines in the upper right corner.
(This is also sometimes called the “hamburger menu” because of the appearance of
the icon.)

19. The options on the Global Menu lets you view and configure details about all your
flows and components, the state of the cluster, FlowFile data, and so on.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 13
14

Select the Flow Configuration History item for an example of a global function.
This will display a list of various changes that have been made on the canvas,
including ones you made above.

Close the history window by clicking the X in the upper right corner.

20. One of the most important options for this class is being able to review the NiFi
documentation from within the UI.
Select the Help item on the Global Menu.
Explore the documentation by clicking on a few different subjects on the left and
viewing the documentation on the subjects on the right. Feel free to refer to this
documentation throughout the remainder of the course.
When you are done exploring the documentation, close the history window by
clicking the X in the upper right corner.

21. Click the search button on the right side of the toolbar (indicated by a magnifying
glass) to open a search box. Enter the search string lists3. This will display a list
of all the ListS3 components on the canvas.

If you select one of the search results, it will select and center the corresponding
component.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 14
15

22. You will not need the processors and process group you created in this demo in
later exercises. You can delete them now so that they do not clutter you canvas.
Select the Demo Group process on the root canvas and click DELETE in the
Operate panel.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 15
16

Hands-On Exercise: Build Your First Dataflow


In this exercise, you will create a simple dataflow by generating a FlowFile and
logging the FlowFile attributes. The exercise is not intended to describe the
details of how the dataflow works at this point.

Create a Dataflow to Generate a FlowFile and log FlowFile


Attributes
1. Configure a GenerateFlowFile processor.
Note: Before you begin building dataflows, you might want to move the solutions
dataflows that are provided for you on the canvas out of your way to make space
for your own dataflows. You can do this using one of the techniques already
demonstrated, for example, by using the Navigate palette or clicking in a blank area
of the canvas and dragging the entire canvas in a desired direction.

a. In the NiFi UI, drag a processor onto the canvas and filter with generateflow.

b. Click ADD.

c. Right-click on the GenerateFlowFile processor and select Configure. (You


can also double-click on the processor).

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 16
17

d. On the SETTINGS tab, name the processor First GenerateFlowFile.

e. Click APPLY.

2. Configure a LogAttribute processor.

a. Drag a processor to the canvas and filter with logattribute to add it.

b. Right-click and select Configure or double-click on the processor.

c. On the SETTINGS tab, name the processor First LogAttribute.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 17
18

d. Check the box next to success for Automatically Terminate Relationships.


Your SETTINGS tab should look like this:

e. Click APPLY.

3. Connect the GenerateFlowFile and LogAttribute processors.

a. Hover over the GenerateFlowFile processor with your mouse and a


connection symbol will appear ( ). Click on this symbol and drag it on top of
the LogAttribute processor. A dashed line will appear and turn green when
the connection is properly positioned.

b. You will see a Create Connection screen. Ensure that success is checked on the
DETAILS tab.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 18
19

c. Click ADD.
Your dataflow should look like this:

Note: Do not start this dataflow. You will have opportunities to run dataflows in
subsequent exercises.

This is the end of the exercise.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 19
20

Hands-On Exercise: Start Building a Dataflow Using


Processors
In this exercise, you will begin creating a dataflow by adding processors only.
Later, connections will be added to process data and write files to disk.
The processors added in this exercise and then connected in the next exercise, will
implement the following scenario:

• Collect the output of an application log file


• Split the contents into multiple files
• Save the files in a destination directory

Begin Dataflow Creation by Adding Processors


1. Configure a TailFile processor.
This processor is used in this exercise to create FlowFiles by reading data from a log
file.

a. In the NiFi UI, drag a processor onto the canvas and filter with TailFile.

b. Click ADD.

c. Right-click on the TailFile processor and select Configure. (You can also
double-click on the processor).

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 20
21

d. On the SETTINGS tab, name the processor Put app log TailFile.

e. On the SCHEDULING tab, set Run Schedule to 10 seconds.

f. On the PROPERTIES tab, enter /var/log/nifi/nifi-app.log for the


File(s) to Tail property, and nifi-app_* for the Rolling Filename Pattern
property.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 21
22

g. Click APPLY.

2. Configure a SplitText processor.


This processor divides textual FlowFiles into smaller files.

a. Drag a processor to the canvas and filter with splittext to add it.

b. Right-click and select Configure or double-click on the processor.

c. On the SETTINGS tab, name the processor Put app log SplitText.

d. Check the boxes next to failure and original for Automatically Terminate
Relationships.
Your SETTINGS tab should look like this:

e. On the PROPERTIES tab, set the Line Split Count to 15.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 22
23

f. Click APPLY.

3. Configure a PutFile processor in order to save the generated FlowFiles to a local


directory.

a. Add a PutFile processor to the canvas and double-click or right-click to


configure it.

b. On the SETTINGS tab, name it Put app log PutFile.

c. Check the boxes next to failure and success for Automatically Terminate
Relationships. This is because the FlowFiles will be written to disk at this point
and are not needed subsequently.
Your SETTINGS tab should look like this:

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 23
24

d. On the PROPERTIES tab, set the Directory to /tmp/nifi/putfile, set


the Conflict Resolution Strategy to replace, and ensure that Create Missing
Directories is set to true.
The PROPERTIES tab should look like this:

e. Click APPLY.
The three processors on your canvas should look something like this:

This is the end of the exercise.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 24
25

Hands-On Exercise: Connect Processors in a Dataflow


Files Used in This Exercise:
Log file /var/log/nifi/nifi-app.log

In this exercise, you will complete creating a dataflow by adding connections.


This will allow use of a log file as the data source, processing the data, and writing
resultant files to a directory on the local system.
Important: This exercise depends on completion of Hands-On Exercise: Start Building
a Dataflow Using Processors. If you did not complete that exercise, use the Start
Dataflow Processors Solution process group (a grouping of components)
provided for you on the Solutions area of your canvas prior to beginning this
exercise, as follows:

1. Double-click on the Start Dataflow Processors Solution process group


to see the processors inside it.

2. Select all processors by holding down the shift key and clicking each one, or holding
down the shift key and dragging your pointer around all components.

3. Right-click and select Copy.

4. On the lower left of your screen, click NiFi Flow to return to the root canvas.

5. Use the Navigate palette or click on the canvas and drag your pointer to the right to
create an empty space for the components.

6. Right-click and select Paste to paste the processors onto the canvas.

7. Move the processors to a convenient spot on the canvas to work with them in this
exercise.

Complete Building a Dataflow to Process Data and Write Files


1. Connect the TailFile and SplitText processors.

a. Hover over the TailFile processor with your pointer and a connection
symbol will appear ( ). Click on this symbol and drag it on top of the

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 25
26

SplitText processor. A dashed line will appear and turn green when the
connection is properly positioned.

b. You will see a Create Connection screen. Ensure that success is checked on the
DETAILS tab.

c. Click ADD.

2. Connect the SplitText and PutFile processors.

a. Drag a connection from the SplitText processor to the PutFile processor.

b. On the DETAILS tab, check the box next to splits. You are only connecting
flowfiles that have been split at this point. The failure and original
relationships were terminated in the SplitText processor.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 26
27

c. Click ADD
Your dataflow should look something like this:

3. Start the dataflow and observe files written to disk.

a. While holding the shift key, select the processors in the dataflow. (You can also
hold down the shift key and drag an outline around all components).

b. On the Operate palette, click the “Start” button (shown as arrow icon) to start
the dataflow

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 27
28

After a few seconds, you should see statistics on the surface of the processors,
indicating that data is moving through your dataflow.

c. In a terminal window, use the ls -l command to list the files in the /tmp/
nifi/putfile directory.

$ ls -l /tmp/nifi/putfile

You should see one or more files. The filenames are of the form
nifi-app.xxx-yyy.log. The numbers xxx and yyy indicate the byte range
from the input file (nifi-app.log) that is in the saved file.

d. After observing the dataflow operate, noticing some statistics on the


processors, and observing files written, stop the dataflow—ensure that all
dataflow processors are selected, then click the “Stop” button (shown as a
square stop icon) on the Operate palette.

This is the end of the exercise.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 28
29

Hands-On Exercise: Build a More Complex Dataflow


Files Used in This Exercise:
Log file /var/log/nifi/nifi-app.log

In this exercise, you will build a more complex dataflow by adding additional
processors and connections to the dataflow previously created.
You will add CompressContent and UpdateAttribute processors to the dataflow.
You will then start the dataflow and observe as the dataflow writes files to the disk.
The dataflow will implement the following scenario:

• Collect the output of an application log file


• Split the contents into multiple files
• Compress the files
• Rename the files to include a timestamp in the name
• Save the files in a destination directory
Important: This exercise depends on completion of Hands-On Exercise: Connect
Processors in a Dataflow. If you did not complete that exercise, use the Complete
Dataflow Connections Solution process group provided for you on the
Solutions area of your canvas prior to beginning this exercise, as follows:

1. Select all dataflow components from inside this process group and copy them.

2. Return to the root canvas, create an empty space for the dataflow, and paste it onto
the canvas.

3. As needed, move the dataflow to a convenient spot on the canvas to work with in
this exercise.

Adding and Connecting CompressContent and


UpdateAttribute Processors
1. Add a CompressContent processor to the canvas.

2. Configure the new processor.

a. Under SETTINGS, name the processor Put app log CompressContent.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 29
30

b. Select failure for Automatically Terminate Relationships.

c. Under PROPERTIES, set the Compression format to gzip.

3. Move the connection that is currently between the SplitText and PutFile
processors to be between the SplitText and CompressContent processors.

a. Click on the connection. It will highlight the ends of the arrow to be red and
blue.

b. Drag the blue end (the end by the point of the arrow) from the PutFile
processor to the CompressContent processor.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 30
31

4. Drag the PutFile processor down on the canvas to make room for the next
processor.
In the dataflow so far, you have collected the log data, split it as needed, and
compressed the resulting file. You now need to rename the new files to include a
prefix with the current time.

5. Add an UpdateAttribute processor to the canvas in the spot where the


PutFile processor was.

6. Configure the processor.

a. Under SETTINGS, name the processor Put app log UpdateAttribute.

b. In the PROPERTIES tab, click the + symbol and add a new property called
filename.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 31
32

c. Set the filename property value to


nifi-applog.${now():format('HH:mm:ss')}.gz. This is a NiFi
Expression Language expression that replaces the original filename with a
nifi-applog prefix, appending a timestamp in format hour:minute:second,
and adding a file extension of .gz.

Note: User-defined properties and the NiFi Expression Language are covered in
depth later.

7. Connect the CompressContent processor to the new UpdateAttribute


processor for the success relationship.

8. Connect the UpdateAttribute and PutFile processors for the success


relationship.
This will save the processed files in the /tmp/nifi/putfile directory as
previously configured in the PutFile processor. You should now have five
processors and your dataflow should look something like this:

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 32
33

9. Run the dataflow.


All the processors are currently in a stopped state. Start them now using one of the
following methods:

• Right-click each component individually and select Start.


• Select all the processors by holding down the Shift key and dragging your
pointer. Then click the “Start” button on the Operate palette to start all the
components.

After a few moments, you will see that the statistics are changing as data flows
through the processors.

10. Confirm that the output files are saved correctly.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 33
34

In a terminal window, list the contents of the /tmp/nifi/putfile/ directory.

$ ls /tmp/nifi/putfile

You should see saved files, correctly compressed and named.


Repeat the command and note that more files are added as the flow continues to
process the application log output.

Note that the number of files saved in the /tmp/nifi/putfile


directory may be less than the number of files generated by the
SplitText processor. This is because NiFi processes FlowFiles
very fast and multiple files may be processed in less than a second.
Because the UpdateAttribute processor names files with a
timestamp with a one-second granularity, files generated within
the same one-second span of time will all have the same name.
When the PutFile processor saves a file, it overwrites any
prior files with the same name—that is, the last file generated
within a given second will be the only one remaining. This
situation is acceptable in a learning exercise, but in a real-world
system, you would usually want to make sure every file is saved.
You can modify the file naming expression that you set in the
UpdateAttribute processor so that every FlowFile is saved
to disk. However, this results in a larger number of files than is
desired for purposes of this exercise.

11. Stop the dataflow after you have confirmed that it is working. You can use the
same techniques to stop the dataflow that you did to start it above, using the “Stop”
button instead of the “Start” button.

You have now created a dataflow that:

• Collects application log data appended to a log file


• Splits the text into batches based on a number of lines
• Compresses each batch of data
• Saves the compressed data to files using filenames that include timestamps

This is the end of the exercise.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 34
35

Hands-On Exercise: Creating a Fork Using


Relationships
Files Used in This Exercise:
Log file /var/log/nifi/nifi-app.log

In this exercise, you will add additional PutFile processors to your existing
dataflow to configure different relationships.
The dataflow will implement the following scenario:

• Collect the output of an application log file


• Split the contents into multiple files
• If successfully split, route the original FlowFile to a separate destination, with the
split files continuing on in the dataflow
• If a FlowFile is not successfully split, route the original file to a different destination,
terminating further processing
• Compress the files
• Rename the files to include a timestamp in the name
• Save the files in a destination directory

Important: This exercise depends on completion of Hands-On Exercise: Build a More


Complex Dataflow. If you did not complete that exercise, use the Complex Dataflow
Solution process group provided for you on the Solutions area of your canvas.
Copy the dataflow inside this process group to an empty area of the root canvas prior to
beginning this exercise.

Adding and Connecting PutFile Processors for original and failure


relationships
1. Review the existing relationships configured in the SplitText processor.

a. Double-click the SplitText processor or right-click and select Configure.

b. Under SETTINGS, observe that the boxes next to failure and original are
checked, while splits is not.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 35
36

With failure checked, files will be dropped at this point if they cannot be split,
with no further processing.
With original checked, if files are successfully split, the original files will be
dropped at this point, but split files will be sent on in the dataflow for further
processing.

2. Duplicate the Put app log dataflow.

a. Ensure all the processors in the Put app log dataflow are stopped.

b. Select all the existing dataflow components (processors and connections) by


holding down the Shift key and dragging your pointer around all components
in the dataflow. Right-click and select Copy, then right-click and select Paste.
(You can also use the copy and paste icons on the Operate palette.)

c. Change the names of each processor in the duplicated flow, replacing Put app
log with Relation at the beginning of each name.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 36
37

3. In the Relation PutFile processor, under PROPERTIES, change the Directory


to /tmp/nifi/relation. Ensure the Conflict Resolution Strategy is set to
replace and Create Missing Directories is set to true.

4. Configure a PutFile processor for the original relationship.

a. Add a Putfile processor to the canvas and name it Relation PutFile


Original.

b. On the PROPERTIES tab, set the Directory to be /tmp/nifi/original.

c. Ensure the Conflict Resolution Strategy is set to replace and Create Missing
Directories is set to true.

d. On the SETTINGS tab, select failure and success for Automatically Terminate
Relationships, since there is nothing further to do with the FlowFiles after they
are saved to disk at this point.

5. Configure a PutFile processor for the failure relationship.

a. Add a Putfile processor to the canvas and name it Relation PutFile


Failure.

b. On the PROPERTIES tab, set the Directory to be /tmp/nifi/failure.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 37
38

c. Ensure the Conflict Resolution Strategy is set to replace and Create Missing
Directories is set to true.

d. On the SETTINGS tab, select failure and success for Automatically Terminate
Relationships.

6. Uncheck the failure and original relationship boxes under SETTINGS in the
SplitText processor.
No boxes should be checked under Automatically Terminate Relationships at
this point. Each relationship—failure, original, and splits—will now be routed to
different PutFile processors and should not automatically be terminated.

7. Connect the SplitText and Relation PutFile Original processors,


selecting original on the DETAILS tab under For Relationships.

8. Connect the SplitText and Relation PutFile Failure processors,


selecting failure under For Relationships.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 38
39

The relevant part of your dataflow should now look something like this:

9. Start all components in the Relation dataflow.

10. After about 15 seconds, stop the entire dataflow to prevent it from using up too
much space on the system.

11. Examine file output.

a. In a terminal window, list the contents of the relation directory.

$ ls -l /tmp/nifi/relation

You should see zipped, timestamped files for today’s date, having filenames
beginning with nifi-applog.

b. List the contents of the original directory.

$ ls -l /tmp/nifi/original

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 39
40

You should see files having filenames beginning with nifi-app and ending
with log.

c. You should not see a failure directory with any files in it.

$ ls -l /tmp/nifi/failure

This should return No such file or directory because there has not
been a failure in splitting files yet.

12. Create a SplitText failure condition and observe the results.

a. In the Relation SplitText processor, set the Maximum Fragment Size


property to 10b and the Header Line Count property to 5.
This will cause the header size to be greater than the maximum split size,
resulting in a failure.

b. Start the dataflow.

c. Note that no files are routed to the CompressContent, UpdateAttribute,


Relation PutFile Original, or Relation PutFile processors.

d. Observe that the Relation SplitText processor shows files inbound and
outbound, and the Relation PutFile Failure processor shows files and
data inbound.

e. Hover over the bulletin section (upper right corner) of the Relation
SplitText processor and observe the errors generated.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 40
41

f. Stop the dataflow.

g. List the contents of the failure directory.

$ ls -l /tmp/nifi/failure

You should see files having filenames beginning with nifi-app and ending
with log.

h. After a few minutes, the Relation SplitText processor bulletin errors


along with the file and data statistics for the Relation SplitText and
Relation PutFile Failure processors are cleared.

13. Return the Relation SplitText properties to their default values by removing
the Maximum Fragment Size value and setting the Header Line Count back to 0.

14. Start the dataflow and note that the Relation PutFile Failure processor
no longer receives files or data, while the Relation PutFile Original and
Relation PutFile processors do.

15. Stop the dataflow.

This is the end of the exercise.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 41
42

Hands-On Exercise: Set Back Pressure Thresholds


Files Used in This Exercise:
Log file /var/log/nifi/nifi-app.log

In this exercise, you will set back pressure object and size thresholds in
connections to create conditions in which back pressure can be observed.
The Put app log dataflow will be used throughout this exercise.
Note that the canvas only refreshes by default every 30 seconds. As you observe
statistics on the connection queues and processors in this exercise, you can right-
click on the canvas and select Refresh in order to see the very latest statistics in your
dataflow.
Note also that statistics shown on processor surfaces will gradually return to zero
after 5 minutes when there is no data being processed, though the statistics on the
connection queues will remain. Some of the screen shots in this exercise show zero
statistics on processor surfaces because of this. You may observe the same thing,
depending on the pace with which you move through the exercise. This is normal NiFi
behavior.
Important: This exercise depends on completion of Hands-On Exercise: Build a More
Complex Dataflow. If you did not complete that exercise, use the Complex Dataflow
Solution process group provided for you on the Solutions area of your canvas.
Copy the dataflow inside this process group to an empty area of the root canvas prior to
beginning this exercise.

Change connection back pressure thresholds and observe the


effects on FlowFile processing
1. Right-click and select Delete on the connection between the TailFile processor
and the SplitText processor. This exercise uses a GenerateFlowFile to
generate test data more easily.

2. Click and drag the SplitText processor to make space for another processor. You
will add this processor back to the dataflow at the end of the exercise

3. Add a GenerateFlowFile processor to your canvas.

4. On the SETTINGS tab of the GenerateFlowFile processor, change the name to


Back Pressure GenerateFlowFile.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 42
43

5. On the the SCHEDULING tab, change the Run Schedule to 2 sec.

6. On the PROPERTIES tab change the File Size to 1 KB

7. Drag a connection between the GenerateFlowFile Processor and the


SplitText Processor for a success relationship.

8. On the SETTINGS tab of the connection between the SplitText and


CompressFile processors (named splits), set the Back Pressure Object
Threshold to 10.

9. Start the GenerateFlowFile and SplitText processors. Do not start any other
processors in the dataflow.

10. After 10- 15 seconds, stop these two processors.


The splits connection should have more than 10 files queued, with the line on
the bottom left of the connection showing red because the maximum number of
objects has been exceeded for this queue.
The first connection, between the GenerateFlowFile and SplitText
processors, should have files queued. While the two processors were running,
after the splits connection exceeded its maximum object threshold, this first
connection continued to queue files.
Note that the SplitText processor stopped processing files once the splits
connection queue reached its maximum object threshold.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 43
44

The statistics shown here will vary.

11. Set the Back Pressure Object Threshold of the first connection (between the
GenerateFlowFile and SplitText processors) to 10.

12. Start the GenerateFlowFile and SplitText processors. Do not start any other
processors in the dataflow.
Note that when the number of objects queued in the first connection reaches 10, the
GenerateFlowFile processor stops sending files.
Observe that the number of objects queued in the splits connection does not
change because it still exceeds the maximum object threshold.
Files still cannot be processed by the SplitText processor.

13. Stop the GenerateFlowFile and SplitText processors.

14. Set the Back Pressure Object Threshold of the first connection (between the
GenerateFlowFile and SplitText processors) to 20.

Note that the object threshold in the first connection queue is no longer exceeded.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 44
45

15. Start the GenerateFlowFile and SplitText processors. Do not start any other
processors in the dataflow.

More files are now being queued in the first connection, but the splits connection
receives no more files and files still are not being processed by the SplitText
processor.

16. After the queued file count in the first connection reaches 20, stop the
GenerateFlowFile and SplitText processors.

17. Set the splits connection Back Pressure Object Threshold to 250 and the Size
Threshold to 5 KB.
Note: If the number of files shown in your splits connection queue is greater
than 250, set the Back Pressure Object Threshold to something greater than the
number of files shown in the queue.
Without starting any processors, the splits connection queue shows the same
number of files queued, but the size threshold is exceeded immediately. You may
need to refresh your canvas to see this.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 45
46

18. Set the splits connection Size Threshold to 10 MB.


Note: If the amount of data shown in your splits connection queue is greater
than 10 MB, set the Size Threshold to something greater than the amount of data
shown in the queue.
Without starting processors, the splits connection queue shows the same
number of files queued, but the size threshold should no longer be exceeded.

19. Start the GenerateFlowFile and SplitText processors. Do not start any other
processors in the dataflow.
Note that more files are now queued in the splits connection and the first
connection queue goes empty. The SplitText processor also shows some data
being processed.

20. Start the remaining processors in the dataflow (CompressContent,


UpdateAttribute, and PutFile).
Statistics can be observed on all processors and the queues are empty (unless the
thresholds are again reached).

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 46
47

21. Stop the entire dataflow after running it for 20-30 seconds.

22. Verify that files have been written to the /tmp/nifi/putfile directory for this
dataflow.

23. Reset the Back Pressure Object Threshold to 10000 and the Back Pressure Size
Threshold to 1 GB in the first and splits connections.

24. Delete the GenerateFlowFile processor and reconnect the TailFile


processor to the dataflow.

This is the end of the exercise.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 47
48

Hands-On Exercise: Simplify Dataflows Using Process


Groups
Files Used in This Exercise:
Log files /var/log/nifi/nifi-app.log
/var/log/nifi/nifi-user.log

In this exercise, you will practice using process groups with dataflows.

• Duplicate the existing flow that collects the output of an application log
• Partition both flows into separate process groups
• Create a third group to save the data using input and output ports

Important: This exercise depends on completion of Hands-On Exercise: Build a More


Complex Dataflow. If you did not complete that exercise, use the Complex Dataflow
Solution process group provided for you on the Solutions area of your canvas.
Copy the dataflow inside this process group to an empty area of the root canvas prior to
beginning this exercise.

Duplicating the Existing Dataflow


Add a dataflow to read and process the nifi-user.log file in the same way the Put
app log dataflow reads and processes the nifi-app.log file. The easiest way to do
this is to duplicate the existing Put app log dataflow.

1. Ensure all the processors in the Put app log dataflow are stopped.

2. Select all the existing dataflow components. Right-click and select Copy, then right-
click and select Paste. This will duplicate the entire flow and create a new dataflow
to run in parallel with the original one.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 48
49

3. Change the names of each processor in the new flow you just created to prepend
Put user log at the beginning of each name, instead of Put app log.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 49
50

4. Modify the TailFile processor in the new Put user log dataflow.

a. Set the File(s) to Tail property to /var/log/nifi/nifi-user.log.

b. Set the Rolling Filename Pattern property to nifi-user_*.

5. View the state of the Put app log TailFile processor by right-clicking on
the processor and selecting View State, then click Clear State. This resets the read
pointer so the processor can beginning reading data again.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 50
51

6. Repeat the previous step to clear the state of the Put user log TailFile
processor in the new dataflow if necessary.

7. Change the filename property of the new Put user log UpdateAttribute
processor to nifi-userlog.${now():format('HH:mm:ss')}.gz.

Now you have two parallel flows, each collecting data from a different log, processing it,
and saving it to same output directory with different file names.

Creating Process Groups from the Existing Dataflows


The next step is to create two process groups out of these two independent dataflows.

8. Select all the components of the original Put app log dataflow, then click the
“Group” button on the Operate palette. This will prompt you for a process group
name. Enter Put app log Group.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 51
52

9. Repeat the same steps for the Put user log (duplicate) dataflow. Give it the
name Put user log Group.

10. Now you have two independent process groups working in parallel, each with their
own flow. Start the two flows and confirm they are working correctly.

11. Stop the flows in each process group before moving to the next step.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 52
53

Creating a Third Process Group to Avoid Duplicate Processors


Both of the dataflows in the new process groups have an identical PutFile processor.
Create a single process group that can be reused for both dataflows to avoid this
duplication.

12. Drag a process group icon onto the canvas. When prompted, set the name to Save
app-user log Group and click ADD.

13. Double-click the Put app log Group process group to view its dataflow.

14. Right-click the PutFile processor and select Copy. (Simply clicking the processor
and entering Ctrl+C will also work.)

15. Return to the root canvas by clicking NiFi Flow in the navigation breadcrumb trail
below the canvas.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 53
54

16. Double-click to open the Save app-user log Group process group. Paste the
PutFile processor you copied above.

17. Now you have a PutFile processor inside the Save app-user log Group
process group. The next step is to add an input port to the process group to receive
data.

a. Drag the input port icon onto the canvas and name the port Put app-user
log InputPort.

b. Connect Put app-user log InputPort to the PutFile processor by


dragging the Put app-user log InputPort connection icon.
Your Save app-user log Group process group should look like this:

18. Open the Put app log Group process group again and delete the connection
between the UpdateAttribute and PutFile processors, then delete the
PutFile processor.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 54
55

19. Add an output port to the canvas and name it Put app log OutputPort.

20. Create a connection from the UpdateAttribute processor to Put app log
OutputPort.

Note: The port will stay in an invalid state until you connect it to another processor.

21. Repeat the same steps inside the Put user log Group process group to add an
output port called Put user log OutputPort.

22. Connect the UpdateAttribute processor to Put user log OutputPort.

23. Return to the root canvas. Connect Put app log Group to Save app-user
log Group. In the configuration pop up window, simply click ADD, as there is only
one input port and one output port at the source and destination.

24. Repeat the steps above to create a connection from Put user log Group to
Save app-user log Group.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 55
56

25. Start the process groups and confirm that the flows still work as they did before.
When you are done, stop the process groups.

This is the end of the exercise.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 56
57

Hands-On Exercise: Using Data Provenance


In this exercise, you will explore the provenance of data in NiFi, utilizing the
provenance search capability, and examining details and lineage of FlowFiles.
Important: This exercise depends on completion of Hands-On Exercise: Creating a Fork
Using Relationships. If you did not complete that exercise, use the Relationship
Dataflow Solution process group provided for you on the Solutions area of
your canvas. Copy the dataflow inside this process group to an empty area of the root
canvas prior to beginning this exercise.

Exploring Provenance of FlowFiles


1. Start all components of the Relation dataflow and run it for about 5 seconds, then
stop the dataflow.

2. List the files in the /tmp/nifi/original directory in time order and note the
most recent FlowFile saved there.

$ ls -lt /tmp/nifi/original

You should see files having filenames beginning with nifi-app and ending with
log. These are files written to disk by the Relation Putfile Original
processor.

3. Copy the filename for the most recent file, for example nifi-
app.16358743-18061163.log.

4. Open the global menu using the “Open menu” icon in the upper right corner of the
UI (indicated by three horizontal lines) and select Data Provenance.

A list of provenance events is displayed.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 57
58

5. Peruse the list of provenance events and hover over the “View Details” (to the left of
each event), and “Show Lineage” and “Go To” icons (to the right of each event). Note
how many events are in the list. In the example screenshot, 1,000 events are shown.

6. Click on the search icon (indicated by a magnifying glass) on the top right of the list
and enter the filename you copied from the /tmp/nifi/original directory.

Your filename will be different

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 58
59

If your work on this exercise continues from one day to the next,
you will need to enter a value for Start Date in the search dialog,
specifying the day you first ran this dataflow .

7. Click SEARCH.
Note the resulting list of events from this search. In this example, 755 events are
listed now.

The list is ordered from the most recent to the oldest events.
Note: Depending on the width of your browser, you might not be able to see the
full contents of some of the cells in the list. If you hover your pointer over the cell, a
popup will display showing the full value in the cell.

8. Click on the Date/Time column to reorder the events from oldest to newest. Note
that the first events listed are RECEIVE, FORK, DROP, and SEND. All remaining
events are CONTENT_MODIFIED.

9. Note the timestamp of the first event. In this example, it is 09:53:15.914.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 59
60

10. Scroll to the end of the list and note the timestamp of that event. In this example it is
09:53:16.602.

Your times will vary.

11. Calculate the difference between these two times. In this example, the difference
is 0.688 seconds. This means that in less than one second, 755 events were
processed for this single FlowFile. This gives some perspective on how fast NiFi can
process files. Your results will vary.

12. Scroll back to the top of the event list. Ensure that the list is still sorted by Date/
Time, from oldest to newest.
The first entry is a RECEIVE event. Notice that this event was produced by the
Relation TailFile component. This is the first processor in the dataflow and
the FlowFile was generated by tailing the nifi-app.log.

13. Click on the “View Details” icon to the left of the RECEIVE event. Ensure you are on
the DETAILS tab.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 60
61

Note the values shown for Time, Event Duration, Type, FlowFile Uuid, File Size,
Component Name and Component Type. You might need to scroll down to see all
these details.
Note that the Source FlowFile Id field shows the bytes of the source file that the
FlowFile contains. The Transit Uri displays the source filename: /var/log/
nifi/nifi-app.log.

14. Click on the ATTRIBUTES tab. The filename attribute shows the name that the
TailFile processor assigned to the FlowFile it generated when reading from the
source file (/var/log/nifi/nifi-app.log). Note that the filename is in the
form nifi-app.xxx-yyy.log. The numbers xxx and yyy indicate the byte
range of the data read from the source file that is contained in this FlowFile.

15. Click OK to return to the provenance event list.


Observe that the second event is a FORK, which means that FlowFiles were
generated from the parent FlowFile. This event was produced by the SplitText
processor.
In the case of the Relation dataflow, the SplitText processor generates a new
FlowFile for every 15 lines contained in parent FlowFiles that it receives from the
upstream processor. These split files are then sent on to CompressContent,
UpdateAttribute, and PutFile processors.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 61
62

16. Click on the “View Details” icon for this FORK event.
Note that the FlowFile Uuid shown on the DISPLAY tab is the same as for the
RECEIVE event (that is, it is the same file) and that it is the parent FlowFile of many
child FlowFiles.

17. On the ATTRIBUTES tab, note that the filename is the same as for the RECEIVE
event.

18. Click OK to return to the provenance event list.


Following the FORK event are SEND and DROP events.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 62
63

From the SplitText processor, the original FlowFile is sent to the Relation
PutFile Original processor where it is written to /tmp/nifi/original
directory and then dropped.

19. View the details for the SEND event.


Note that FlowFile Uuid is the same as for previous events and the size is the same,
indicating it is the same file.

20. Scroll down if necessary and note Transit Uri refers to the filename of the FlowFile
it received from the upstream processor (SplitText).
The PutFile processor uses the transit URI of each FlowFile it receives as the
destination to which the FlowFile contents are saved.

21. Click on the ATTRIBUTES tab.


Note that the filename is still the same as in the previous provenance event (FORK).

22. Return to the provenance event list.

23. View the details for the DROP event.


Note that the FlowFile Uuid value on the DETAILS tab matches those observed for
previous events.

24. Scroll down if necessary and note under Details that this event was Auto-
Terminated by success Relationship.

25. Click on the ATTRIBUTES tab and note that the filename is the same as the
FlowFiles in the previous events.

26. Return to the provenance event list.


Note the first CONTENT_MODIFIED event. The content of the FlowFile was
modified by the CompressContent processor.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 63
64

View the event’s details and note that the UUID is now different from the ones in
the previous events. This FlowFile is one of the files you saw in the list of Child
FlowFiles shown in the details of the FORK event. The Parent FlowFile is the
FlowFile first generated by TailFile, as you can see from its UUID.

27. On the ATTRIBUTES tab, note that the filename and segment.original.filename
give the name of the first FlowFile—in this example, nifi-
app.16358743-18061163.log. tailfile.original.path gives the path to the
nifi-app.log read by the TailFile processor and text.line.count gives the
number of lines specified for splitting FlowFiles in the SplitText processor—15.
In the example, the first FlowFile was split into 751 child FlowFiles, as shown in the
provenance event list.

28. Return to the provenance event list.

29. Click on the “Show Lineage” icon to the right of the event line for the
CONTENT_MODIFIED event.

The red circle indicates the event for which the lineage is being examined
—CONTENT_MODIFIED in this example.

A FlowFile icon ( ) is situated between the FORK and CONTENT_MODIFIED


events. This represents the FlowFile for the currently visible lineage.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 64
65

This lineage shows the events involved from compression of the first file generated
by the SplitText processor to its being given a timestamp and a new filename in
the UpdateAttribute processor, to its being written to disk and dropped.

30. Hover over the FlowFile icon.


The lines with arrows all turn red, showing the entire lineage for this particular
Flowfile.

31. Right-click on the CONTENT MODIFIED event of the lineage and click View details.

This is the same information seen for the first CONTENT_MODIFIED event by
clicking the “View Details” icon from the provenance event list. The FlowFile Uuid
specified here is the FlowFile for which this lineage is being examined.

32. Click OK to return to the lineage diagram.

33. Right-click on the ATTRIBUTES MODIFIED event of the lineage and click View
details.
Under DETAILS, note that the FlowFile Uuid is the same as for the
CONTENT_MODIFIED event seen previously.

34. Click on the ATTRIBUTES tab.


Note that the filename has a timestamp and a gz extension as a result of the
FlowFile being compressed and renamed.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 65
66

Observe other attribute values visible in this window.

35. Click OK to return to the lineage diagram.

36. Right-click on the SEND event of the lineage and click View details.
Note that the FlowFile Uuid is still the same as for the first CONTENT_MODIFIED
event and that the Transit Uri shows that the compressed, timestamped file was
written to the /tmp/nifi/relation directory as specified in the Relation
PutFile processor.

37. On the ATTRIBUTES tab, note that the filename is the same as observed in the
ATTRIBUTES MODIFIED part of the lineage.

38. Return to the lineage diagram.

39. View the details of the DROP event.


Notice that under Details, this event was Auto-Terminated by success
Relationship.

40. On the ATTRIBUTES tab, observe that the filename is the same as for the SEND
lineage event (timestamped and zipped) and the UUID (seen at the bottom of the
window) is the same as the FlowFile UUID for the CONTENT_MODIFIED event.
The file is dropped after being written to disk.

41. Return to the lineage diagram.

42. Right-click on the FORK event and select Find parents.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 66
67

The lineage is expanded to show the parent lineage of the CONTENT MODIFIED
FlowFile.
Another FlowFile icon is visible, representing the first FlowFile coming from a
RECEIVE event that is now also visible. Hovering over this icon shows its lineage
through being written to disk and dropped.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 67
68

43. Right-click on the FORK event and click Expand.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 68
69

A diagram of child FlowFile lineages is displayed. You can drag the view to the right
or left to see more of the diagram.

Note that if there are 1,024 or more child FlowFiles, clicking on


Expand will fail with a message like this:
Failed to expand children for lineage
of event with ID 1505156 due to:
org.apache.lucene.search.BooleanQuery
$TooManyClauses: maxClauseCount is set to 1024
rather than displaying graphical lineages. You can verify the total
number of child FlowFiles by viewing the details of the FORK
event. If you want to view expanded child lineages, re-run the
Relation dataflow for 2-3 seconds and filter your provenance
events with the most recent FlowFile from this run. Display the
lineage of the first CONTENT_MODIFIED event, and repeat the
examination of lineage details, including a new attempt to expand
the lineage of the FORK event.

44. Explore information in the lineage diagram by right-clicking on various events.

45. Right-click on the FORK event and click Collapse.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 69
70

46. Click the arrow icon on the upper right of the window to return to the provenance
event list.

47. Click the arrow on the far right of the first CONTENT_MODIFIED event to view the
Relation CompressContent processor on the canvas that generated the event.

This is the end of the exercise.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 70
71

Hands-On Exercise: Creating, Using, and Managing


Templates
In this exercise, you will create and use a template. You will also import an
existing template file.

Creating a Template
Build a Simple Dataflow
Build a very simple flow from which you will create a template.

1. In a new area of the canvas, add a GenerateFlowFile processor. Give it the


name Template GenerateFlowFile. Set the schedule so the processor runs
every 10 seconds.

2. Add a LogAttribute processor. Give it the name Template LogAttribute.


Auto-terminate the success relationship.

3. Connect the GenerateFlowFile processor to the LogAttribute processor for


the success relationship.

Your flow should look something like this:

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 71
72

Create a Template
4. Select all the components of the simple flow you created above.

5. Click the “Create Template” button on the Operate palette.

6. Give the template the name Simple Flow Template, then click CREATE.

Use the Template


7. Drag the “Template” icon from the NiFi toolbar onto a clear area of the canvas.

8. When prompted, ensure that the Simple Flow Template you created above is
selected, then click ADD.

9. A second copy of the simple flow will be added to the canvas. Review the
configurations of the new flow’s components with those of the original to confirm
that they are the same.

Export the Template


10. Open the global menu in the upper right corner of the canvas and select Templates.

11. The NiFi Templates window appears showing a list of all templates, including the
one you created in the previous section.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 72
73

12. Click the “Download” button to the right of your template. This will create an XML
containing the template.

13. Firefox will ask if you want to open or save the template XML file.
Choose Save File, then click OK. This will save the template to the file
Simple_Flow_Template.xml in your Downloads directory on your remote
desktop’s file system.

14. Open the XML file in an editor to review it. You can use whatever editor you prefer.
If you do not have a preference, you can use the Pluma editor following these steps:

a. Open Pluma using the editor icon on your remote desktop toolbar at the top of
the browser window.

b. Click Open to bring up a file browser.

c. In the Places selector on the left, click training to view the contents of your
home directory on the right.

d. Open the Downloads directory.

e. Select Simple_Flow_Template.xml and click the Open button.


Browse through the file but do not edit it. When you are done, close the editor.

When you have completed this section of the exercise, you can delete the components of
both the original and duplicate simple flows you created above.

Importing a Template
15. Click the “Upload Template” button on the Operate palette.
You need have nothing selected on the canvas.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 73
74

16. Click the magnifying glass icon to open up a file browser.

17. In the Places selector on the left, click training to view the contents of your home
directory on the right.
Browse to the training_materials/nifi/exercises directory. Open the
putfile-dataflow-template file.
You should get a message confirming that the template was imported correctly.

18. Select the Templates option on the global menu. Confirm that the template you
just imported on the template list. If you are not sure which one it is, find the most
recent one based on the Date/Time column.

19. Return to the canvas. Drag the template icon onto a clear area of the canvas and
choose the newly uploaded template.
The flow defined by the imported template will appear.

When you are done with this section of the exercise, delete the flow created by the
imported template.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 74
75

Back Up Your Canvas


The NiFi UI does not have an “undo” function. To avoid accidentally losing or changing
your flows, you might want to periodically back up your entire canvas so that you can
recreate it if needed. One way to do this is to export the canvas as a template.

20. Make sure you are viewing the root canvas. Select all the components on the canvas.
The easiest way to do this is using Ctrl+A.

21. Click the “Create Template” button on the Operate palette as you did earlier.

22. Give the new template a name and description that provides key information about
what the backup template contains, such as a date and what was included.

23. Review the template list to confirm that the template was saved.

This is the end of the exercise.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 75
76

Hands-On Exercise: Versioning Flows Using NiFi


Registry
In this exercise, you will explore options in NiFi Registry for versioning flows.

Create a NiFi Registry Bucket


1. In your remote desktop browser, go to the NiFi Registry UI using the browser
toolbar bookmark or going to http://master.example.com:18080/nifi-
registry.
The top-level page of the NiFi Registry UI shows a list of versioned flows. (The list
will be empty at this point.)

2. In the NiFi Registry UI, click the “Settings” icon (indicated by a symbol of a wrench
or spanner). This opens the Administration page which shows the list of buckets.
(The list is probably empty at this point.)

3. Click the NEW BUCKET button.

4. Enter the bucket name Exercise Flows and click CREATE.

5. The new bucket should be displayed.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 76
77

6. Return to the NiFi Registry main page by clicking the NiFi Registry navigation link
at the top of the window.

Configure a NiFi Registry Client in NiFi


7. Go to the NiFi UI in a separate tab.

8. Select Controller Settings from the global menu.

9. Select the REGISTRY CLIENTS tab. The list of clients should be empty.

10. Click the “Register a new registry client” button, indicated by a plus sign (+).

11. Enter the name Local registry and the URL of the NiFi Registry server:
http://master.example.com:18080. Then click ADD.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 77
78

12. Close the NiFi Settings window to return to the canvas.

Enable Versioning for a Process Group


Create a New Process Group to Practice Versioning

13. On the root canvas, locate the flow with the label First Dataflow.

14. Select the components of the flow and copy them to the clipboard using the “Copy”
button on the Operate palette.

15. Drag the process group icon from the toolbar onto the canvas and name the group
Versioned Flow Group.

16. Double-click to open the process group, then paste the components you copied
above.

17. Use the breadcrumbs on the lower left of the canvas to return to the root canvas.

Start Versioning

18. Right-click on the process group and select Version > Start Version Control.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 78
79

19. You will be prompted to name the flow and enter comments for the current version.
Use the following values:
• Registry: Local registry (you created this registry client earlier)
• Bucket: Exercise Flows (you created this bucket earlier)
• Flow Name: Simple Versioned Flow
• Flow Description (optional): a description of your choosing
• Version Comments: initial version

20. Click SAVE to commit the initial version of the flow in the process group into the
flow repository.
Note the green check mark that now appears on the process group. This indicates
that the current flow version is the most recent one in the repository.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 79
80

Explore Versioning and NiFi Registry


Make Local Changes and Commit to the Repository

21. Open the versioned process group and make small changes to the flow. For
instance, try changing the names of the two processors to include the word
Versioned at the beginning.

22. Return to the tab in which you were viewing the NiFi Registry UI. You should be
viewing the top-level page showing a list of versioned flows. The list should now
include the flow you started versioning above. Notice the name of the flow (which
you set when you started versioning) and the number of current versions (just one,
so far).

23. Return to the root canvas and notice that the green check mark on the process
group has changed to a gray star. This indicates that you have made changes to the
flow locally, but not yet added the new version to the repository.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 80
81

24. View the list of changes by right-clicking on the process group and selecting
Version > Show local changes.

A list of the all the changes you have made since the initial version was committed
will be displayed.

25. Note that each change shows you the name of the component that was changed,
the type of change, and a comparison between the old and new versions of the
component.

26. Click the “Go To” button to the right of one of the changes (indicated by an arrow)
to view the changed component on the canvas.

27. Return to the root canvas. Right-click the process group and select Version >
Commit local changes.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 81
82

28. Enter a comment that describes the changes you have made, then click SAVE.

29. Notice that the green check mark is now shown on the process group again,
indicating that the version on the canvas is the same as the latest version in the
repository.

View Version Information


30. Return to the tab in which you have the NiFi Registry UI open. Make sure you are
viewing the top-level page and refresh the page to view the list of versioned flows.
The list should show the flow you started versioning above. Click the drop-down
icon to view the flow details.

31. Review the flow information on the left and the change log on the right. Notice
that the changes are shown in reverse chronological order. Try clicking to switch
between Version 1 and Version 2 to view the version comments and commit date.

Optional: Continue Exploring


If you have extra time, try exploring versioning more. For instance, try

• Reverting local changes before committing them


• Changing the version of the process group
• Making local changes to an older version on the canvas
Note: You might want to consider versioning flows you create and modify throughout the
remainder of the exercises. That will allow you to track your changes, and recover any
flows that are lost or accidentally modified.

This is the end of the exercise.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 82
83

Hands-On Exercise: Working with FlowFile Attributes


In this exercise, you will generate FlowFiles and modify their attributes. You will
route FlowFiles based on attributes, log FlowFile attributes and view the logged
attributes.

Create a Dataflow to Generate FlowFiles, Set Attributes, Route


Based on Attributes, and Log Attributes
1. Configure a GenerateFlowFile processor.

a. Add a GenerateFlowFile processor to the canvas and name it Attr


Generate Small Files.

b. Set Run Schedule to 10 sec.

c. Under PROPERTIES, set File Size to 1 b and ensure Batch Size is set to 1,
Data Format is Text and Unique FlowFiles is false.

2. Configure an UpdateAttribute processor.

a. Add an UpdateAttribute processor to the canvas and name it Attr


UpdateAttribute 1.

b. Add a property called myAttribute and give it the value Initial-File.

This simply sets an initial attribute that will be changed subsequently.

3. Connect the Generate Small Files processor to the Update Attribute 1


processor for a success relationship.

4. Configure a second UpdateAttribute processor.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 83
84

a. Add an UpdateAttribute processor to the canvas and position it below and


to the left of the UpdateAttribute 1 processor.

b. Name the processor Attr UpdateAttribute 2.

c. Add a property called myAttribute and give it the value


${myAttribute:replace('Initial-File','Prod1-File')}.

This UpdateAttribute processor will replace the value of myAttribute


initially set (Initial-File) with the new value Prod1-File.

5. Configure a third UpdateAttribute processor.

a. Add an UpdateAttribute processor to the canvas and position it below and


to the right of the UpdateAttribute 1 processor.

b. Name it Attr UpdateAttribute 3.

c. Add a property called myAttribute and give it the value


${myAttribute:replace('Initial-File','Prod2-File')}.
This UpdateAttribute processor will replace the value of myAttribute
initially set (Initial-File) with the new value Prod2-File.

6. Connect the UpdateAttribute 1 processor to the UpdateAttribute 2


processor for a success relationship.

7. Connect the UpdateAttribute 1 processor to the UpdateAttribute 3


processor for a success relationship.
Your dataflow should now look something like this:

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 84
85

8. Add a LogAttribute processor to the canvas.

a. Position it centered below the UpdateAttribute 2 and


UpdateAttribute 3 processors.

b. Name it Attr LogAttribute 1.

c. Set the Bulletin Level to INFO under SETTINGS.


This will show attributes logged in bulletins for this processor.
Attributes will also be logged in /var/log/nifi/nifi-app.log.
Keep all other default settings. Note that the property value .* for Attributes
to Log by Regular Expression will log all attributes.

9. Connect the UpdateAttribute 2 processor to the LogAttribute 1 processor


for a success relationship.

10. Connect the UpdateAttribute 3 processor to the LogAttribute 1


processor for a success relationship.
This part of your dataflow should now look something like this:

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 85
86

11. Configure a RouteOnAttribute processor.

a. Add a RouteOnAttribute processor to the canvas and position it below the


LogAttribute 1 processor.

b. Name it Attr RouteOnAttribute.

c. Add a property called IsProd1 and give it the value


${myAttribute:equals("Prod1-File")}.

d. Add another property called IsProd2 and give it the value


${myAttribute:equals("Prod2-File")}.
Your PROPERTIES tab should look like this:

This allows FlowFiles to be routed based on these two attributes.

e. Check unmatched for Automatically Terminate Relationships.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 86
87

12. Connect the LogAttribute 1 and RouteOnAttribute processors for a


success relationship.

13. Configure a second LogAttribute processor.

a. Add a LogAttribute processor to the canvas and position it below and to the
left of the RouteOnAttribute processor.

b. Name it Attr LogAttribute 2.

c. Set the Bulletin Level to INFO.

d. Check success for Automatically Terminate Relationships.

14. Configure a third LogAttribute processor.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 87
88

a. Add a LogAttribute processor to the canvas and position it below and to the
right of the RouteOnAttribute processor.

b. Name it Attr LogAttribute 3.

c. Check success for Automatically Terminate Relationships.

15. Connect the RouteOnAttribute and LogAttribute 2 processors. Check


IsProd1 under For Relationships.

16. Connect the RouteOnAttribute and LogAttribute 3 processors. Check


IsProd2 under For Relationships.
The final part of your dataflow should look something like this:

Run the Dataflow to View Attributes That Are Set and Logged
You will now run successive portions of the dataflow just created, to observe FlowFiles
moving through it and to check attributes that are set and logged.
Note that attributes are logged to the file /var/log/nifi/nifi-app.log.

17. Start the Generate Small Files processor. Let it run until at least two files are
queued in the subsequent connection.
Refresh the canvas as needed to see the latest number of files queued.

18. Stop the Generate Small Files processor.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 88
89

Note the number of files queued in the connection.

19. Start the UpdateAttribute 1 processor.


This adds the Initial-File attribute to all FlowFiles processed here.
Note that both of the connections coming from this processor have the same
number of files queued. Attributes are modified and FlowFiles cloned in order to
send the same number of FlowFiles to each connection queue. When looking at the
surface panel, the UpdateAttribute 1 processor will therefore show double the
number of FlowFiles going out as coming in.

20. Once all FlowFiles have been sent to the queues coming out of the
UpdateAttribute 1 processor, stop it.

21. View the values of myAttribute in both queues coming out of the
UpdateAttribute 1 processor.

a. Right-click on one of the connections and select List queue.

b. Click the View Details icon to the left of any queue item line.

c. Note the value of myAttribute on the ATTRIBUTES tab.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 89
90

d. View other queue items for both connections to verify that all FlowFiles have
the Initial-File attribute.

22. Start the UpdateAttribute 2 processor, then stop it after a few seconds.

23. View the values of myAttribute in the queue items for the connection coming out
of this processor.
It should be Prod1-File.

24. Start the UpdateAttribute 3 processor, then stop it after a few seconds.

25. View the values of myAttribute in the queue items for the connection coming out
of this processor.
It should be Prod2-File.

26. Start the LogAttribute 1 processor.


On the surface, note the number of files that came in and out of this processor.
It should correspond to the total number of FlowFiles processed by the
UpdateAttribute 2 and UpdateAttribute 3 processors.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 90
91

27. Hover over the bulletin icon in the upper right of the processor and note the
attributes shown.

In this example, the Prod1-File value for myAttribute is visible.

28. Stop the LogAttribute 1 processor.

29. View the values of myAttribute in the queue items for the connection coming
from the LogAttribute 1 processor.
Both Prod1-File and Prod2-File should be visible.

30. Start the RouteOnAttribute processor and then stop it after a few seconds.

31. View the values of myAttribute in the queue items for both connections coming
from this processor.

32. Start the LogAttribute 2 processor.


Note the bulletin information visible by hovering over the bulletin icon on the
processor surface.

33. View bulletin information by selecting Bulletin Board from the Global Menu.
Logged attributes should be visible here as well.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 91
92

34. Stop the LogAttribute 2 processor.

35. Start the LogAttribute 3 processor.


Note that there is no bulletin icon on the processor surface. The Bulletin Level of
WARN (the default) does not trigger bulletins for attributes logged. Bulletins were
visible for the LogAttribute 1 and LogAttribute 2 processors because the
Bulletin Level was set to INFO for these processors.

36. Stop the LogAttribute 3 processor.

37. View the logged attributes in nifi-app.log.

a. Use the less command to look at nifi-app.log.

$ less /var/log/nifi/nifi-app.log

b. Once you are viewing the file, search for logged attributes by typing /
Standard FlowFile Attributes to find logged attribute entries related
to this dataflow.

/Standard FlowFile Attributes

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 92
93

For example:

--------------------------------------------------
Standard FlowFile Attributes
Key: 'entryDate'
Value: 'Wed Sep 11 14:16:42 PDT 2019'
Key: 'lineageStartDate'
Value: 'Wed Sep 11 14:16:42 PDT 2019'
Key: 'fileSize'
Value: '1'
FlowFile Attribute Map Content
Key: 'filename'
Value: 'c464c054-9520-4b91-b281-2543dcf78b00'
Key: 'myAttribute'
Value: 'Prod2-File'
Key: 'path'
Value: './'
Key: 'uuid'
Value: 'c464c054-9520-4b91-b281-2543dcf78b00'
--------------------------------------------------

c. Type n to view the next log entry in the file.

d. Stop viewing the file by typing q.

This is the end of the exercise.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 93
94

Hands-On Exercise: Using the NiFi Expression


Language
Files Used in This Exercise:
Log files (local) ~/training_materials/nifi/data/nel/
authority-providers.xml
~/training_materials/nifi/data/nel/
Exchange_Data.txt

In this exercise, you will practice using the NiFi Expression Language.
Your dataflow will

• Collect files from an input directory


• Check whether the files are regular text files or XML
• If they are regular text files, search for a particular string in the content of the files
and extract only matching records
• Save the output to a target directory

Route XML files without changes


1. Create a new process group called ExpressionLanguage for the dataflow you
will create in this exercise. This will let you separate the components you create
here from the ones that already exist on the canvas.

2. Configure a ListFile processor.


This processor collects a list of files.

a. Add a ListFile processor to the process group and name it NEL ListFile.

b. Set the Input Directory property to /home/training/


training_materials/nifi/data/nel.
Keep the remaining default values.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 94
95

3. Configure a FetchFile processor.


This processor receives the list of files from ListFile and brings the content of
these files into the dataflow.

a. Add a FetchFile processor to the process group and name it NEL


FetchFile.

b. Check failure, not.found, and permission.denied for Automatically


Terminate Relationships.

Keep the remaining default values.

4. Connect the ListFile and FetchFile processors for the success relationship.

5. Determine the file types of the incoming file types based on the file extension.

a. Add an UpdateAttribute processor, naming it NEL UpdateAttribute.

b. Add a new property under PROPERTIES called FileExtension.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 95
96

Set the property to an expression to extract the file type—that is, the portion of
the filename after the dot (.): ${filename:substringAfter('.')}

6. Connect the FetchFile and UpdateAttribute processors for the success


relationship.

7. Route the files based on the FileExtension attribute set in the upstream
processor you just added.

a. Add a RouteOnAttribute processor, naming it NEL RouteOnAttribute.

b. Add a new property called isXML.

c. Set the value of the new property to an expression that returns true if the file
extension is xml or false otherwise: ${FileExtension:equals('xml')}

8. Connect the UpdateAttribute and RouteOnAttribute processors for the


success relationship.

9. Save XML files to the local /tmp/nifi/nel directory.

a. Add a PutFile processor and name it NEL PutFile.

b. Set the Directory property to /tmp/nifi/nel and Conflict Resolution


Strategy to replace.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 96
97

c. Check failure and success for Automatically Terminate Relationships.

10. Connect the RouteOnAttribute processor to the new PutFile processor for
the isXML relationship.
Your dataflow should look something like this:

Extract Specific Lines from non-XML Text Files


11. Add a SplitText processor to separate each line in the text files. Name it NEL
SplitText. Auto-terminate the failure and original relationships. Set the Line
Split Count property to 1 to generate a FlowFile for each line in the incoming files.

12. Connect the RouteOnAttribute and SplitText processors for the unmatched
relationship. This makes sure files other than XML files go to the SplitText
processor.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 97
98

13. Extract records containing the date 2003-07.

a. Add a RouteOnContent processor, naming it NEL RouteOnContent.

b. Auto-terminate the unmatched relationship.

c. Set the Match Requirement property to content must contain match.

d. Add a property called StringToCompare and provide the following regular


expression to find the correct lines: (\W|^)2003-07.*.

e. Make a connection from the SplitText processor to the RouteOnContent


processor for the splits relationship.

14. Configure a MergeContent processor to merge desired individual records into a


single file. Name it NEL MergeContent

a. Check failure and original for Automatically Terminate Relationships since


we only need merged records.

b. Set the Delimiter Strategy property value to Text.

c. To have each matched record on separate lines, add a single empty line under
the Demarcator property, using Shift+Enter.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 98
99

15. Make a connection from the RouteOnContent processor to the MergeContent


processor for the StringToCompare relationship.

16. Make a connection from the MergeContent processor to the existing PutFile
processor for the merged relationship. This will save files in the same destination
directory.
Your dataflow should now look something like this:

17. Before starting the dataflow, verify the files to be pumped into the flow.

a. Change directories to ~/training_materials/nifi/data/nel where


needed source files are located.

$ cd ~/training_materials/nifi/data/nel

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 99
100

b. List the contents of this directory.

$ ls -l

You should see the files authority-providers.xml and


Exchange_Data.txt. Note that the size of the authority-
providers.xml file is about 2K and the size of the Exchange_Data.txt
file is about 11K.

c. Use the more command to view some of the records in the


Exchange_Data.txt file.

$ more Exchange_Data.txt

Press the spacebar to scroll through the file.


This file contains 2003 data. We want to extract 2003-07 data from it.

d. Stop viewing the file by typing q.

18. Start the dataflow you just created.


Note that you must clear the state of the ListFile processor each time you
run the dataflow, except for the first time, so that it will list the same files in the
specified input directory. This is because ListFile looks for new or changed files
since its last execution and will not process files it has previously found unless its
state is cleared.
If necessary, right-click on the ListFile processor and select View state. On the
Component State pop-up window, click on Clear state.

Then start the dataflow.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 100
101

19. View the files in the PutFile destination directory /tmp/nifi/nel.

$ ls -l /tmp/nifi/nel

Note that the size of the XML file is still about 2K, but the size of the txt file is
significantly smaller due to extracting only records containing the string 2003-07.

20. Use the more command to view the contents of the txt file.

$ more /tmp/nifi/nel/Exchange_Data.txt

Note that only records containing the string 2003-07 are shown. The content of
the file has been filtered using NiFi expression language.

21. Stop the dataflow.

Simplify the Dataflow by Reducing the Number of Processors


22. If you desire to preserve the current dataflow as it is before modifying it, copy the
existing ExpressionLanguage group.
Rename the copied group ExpressionLanguage Simplified or something
similar.
Work with the Simplified dataflow in the remaining steps of this exercise.

23. Modify the configuration of the RouteOnAttribute processor by changing


the isXML property to add another function. Replace FileExtension with
filename:substringAfter('.') in the original expression, resulting in the
following:
${filename:substringAfter('.'):equals('xml')}
This will check the filename extension for the string xml and return true if this
condition is met, all in one processor.

24. Remove the connections around the UpdateAttribute processor and then
remove the processor, as it is no longer needed.

25. Connect the FetchFile processor directly to the RouteOnAttribute processor


for a success relationship.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 101
102

Your dataflow should now look something like this:

This simplified dataflow provides identical functionality to the previous dataflow


with one fewer processor.

26. Check the timestamps of the files already written to /tmp/nifi/nel.

$ ls -l /tmp/nifi/nel

27. Start the dataflow and check the files written to /tmp/nifi/nel to verify correct
functionality.
Clear the state of the ListFile processor first, if necessary.
Note that the timestamps on the files should have changed.

28. Stop the dataflow.

This is the end of the exercise.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 102
103

Hands-On Exercise: Building an Optimized Dataflow


In this exercise, you will create a dataflow that controls the rate at which data
moves through the flow, names files based on file size, and writes them to disk.

Build and Run a Dataflow with a Controlled Rate of Ingestion


Configure a Dataflow Using a ControlRate Processor

1. Configure three GenerateFlowFile processors.

a. Add a GenerateFlowFile processor to the canvas, naming it Opt 1KB


GenerateFlowFile.

b. Set Run Schedule to 10 sec.

c. Set the File Size property to 1KB and keep the remaining defaults.

d. Add a second GenerateFlowFile processor to the canvas, naming it Opt 10KB


GenerateFlowFile.

e. Set Run Schedule to 30 sec.

f. Set the File Size property to 10KB and keep the remaining defaults.

g. Add a third GenerateFlowFile processor to the canvas, naming it Opt 100KB


GenerateFlowFile.

h. Set Run Schedule to 60 sec.

i. Set the File Size property to 100KB and keep the remaining defaults.

2. Add a Funnel to the canvas.

3. Drag connections from each of the three GenerateFlowFile processors to the


Funnel for the success relationship.
Your dataflow should look something like this at this point:

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 103
104

4. Configure a ControlRate processor to manage the data flow rate.

a. Add a ControlRate processor to the canvas, naming it Opt ControlRate.

b. Check failure for Automatically Terminate Relationships.

c. Ensure the Rate Control Criteria property is set to data rate, and the Time
Duration property is set to 1 min.

d. Set the Maximum Rate property to 120KB.

5. Drag a connection from the Funnel to the ControlRate processor.

6. Configure an UpdateAttribute processor with rules for naming files based on


file size.

a. Add an UpdateAttribute processor to the canvas, naming it Opt Rules


UpdateAttribute.

b. At the bottom of the SETTINGS tab, click ADVANCED.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 104
105

The ADVANCED tab is available from any of the other tabs in the
processor configuration.

c. On the subsequent screen under FlowFile Policy, click the drop-down arrow
and select use original.

d. Click the + to the right of Rules to add a rule.

e. Name the rule Size1KB and click ADD.

f. Click the + to the right of Conditions and add the following Expression:
${fileSize:le(1024)}

g. Click the + to the right of Actions.

h. Enter filename in the box for Attribute.

i. Enter the following for the Value:


filesize-log-${nextInt()}.${now():format('HH:mm:ss')}-
small
This will name files of 1KB size with a prefix of filesize-log, followed by a
unique integer value, a timestamp, and ending with the string small.

j. Click ADD.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 105
106

k. Click SAVE at the lower right-hand corner to save this rule.


The Rule settings should look like this:

Note that the name of the rule appears on the upper right of the screen when a
rule is selected on the left side of the screen.

l. Add another rule and name it Size10KB.

m. In the Add Rule screen in the box Copy From Existing Rule (Optional), type
Size.
This will show the Size1KB rule previously configured.

n. Select the existing Size1KB rule to use as a template for this new rule.

o. Click ADD.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 106
107

p. Click in the box containing the expression for the Size10KB rule and change it
to:
${fileSize:gt(1024)}

q. Click OK.

r. Add another condition of ${fileSize:le(10240)}.


This will match file sizes greater than 1KB and less than or equal to 10KB.

s. Click in the Value entry for the filename attribute and change the string
small to be medium.

t. Click SAVE.
Your rules should now look like this:

u. Add a third rule and name it Size100KB.

v. In the Add Rule screen in the box Copy From Existing Rule (Optional), type
Size.
This will show the Size1KB and Size10KB rules previously configured.

w. Select the existing Size10KB rule to use as a template for this rule and click
ADD.

x. Change the expressions to be as follows:


${fileSize:gt(10240)}
${fileSize:le(102400)}
This will match file sizes greater than 10KB and less than or equal to 100KB.

y. Click in the Value entry for the filename attribute and change the string
medium to be large.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 107
108

z. Click SAVE.
Your rules should now look like this:

aa. Click the X in the upper right corner of the Rules screen.

7. Drag a connection from the ControlRate processor to the UpdateAttribute


processor for the success relationship.

8. Configure a CompressContent processor.

a. Add a CompressContent processor to the canvas and name it Opt


CompressContent.

b. Auto-terminate the failure relationship.

c. Set the Compression Format property to gzip and Update Filename to true.
This will compress the file using the gzip format and add a gz extension to the
filename automatically.

9. Drag a connection from the UpdateAttribute processor to the


CompressContent processor for the success relationship.

10. Configure a PutFile processor.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 108
109

a. Add a PutFile processor to the canvas and name it Opt PutFile.

b. Check failure and success for Automatically Terminate Relationships.

c. Set the Directory property to /tmp/nifi/opt, the Conflict Resolution


Strategy to replace, and ensure the Create Missing Directories property is
set to true.

11. Drag a connection from the CompressContent processor to the PutFile


processor for the success relationship.
Your dataflow should now look something like this:

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 109
110

Run the Dataflow and Observe the Controlled Flow of Data

12. Select all the processors of the dataflow and start them.

13. Note the number of files and amount of data shown on the surface of the
ControlRate processor.
Refresh the canvas frequently in order to see current statistics as you run this
dataflow.

14. Observe that when the amount of data flowing to the ControlRate processor
exceeds 120KB, data starts to be queued in the connection just preceding it.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 110
111

Recall that the rate of 1KB file generation is every 10 seconds, that of the 10KB files
is every 30 seconds, and that of the 100KB files is every 60 seconds. One file of each
size is generated immediately upon starting the dataflow, then the timer generates
files per this schedule. Therefore, 111KB of data is sent through the dataflow right
away (one 100KB, one 10KB, and one 1KB file).
Over the next 30 seconds, two (or three, depending on timing) more 1KB files are
sent (10 seconds each) and one 10KB file is sent (at the 30 second mark). This totals
six (or seven) files with 123KB (or 124KB) of data, and at that point, additional files
should be queued.

15. Note when queued files begin to be released to continue through the dataflow.

16. Observe the number of files and amount of data shown on the surface statistics of
the remaining processors in the dataflow. Compare this to the number of files and
amount of data generated by the three GenerateFlowFile processors.

17. List the files saved in the /tmp/nifi/opt directory.

$ ls -l /tmp/nifi/opt

The filenames reflect file sizes and have timestamps, per the configuration in the
UpdateAttribute processor.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 111
112

Note the number of files shown and compare this to the file count that has passed
through the dataflow.
Compare the file count for each file size with the numbers generated for the
dataflow.

18. To verify the timing of files sent through the dataflow, sort the file listing by
modification time by adding the t option:

$ ls -lt /tmp/nifi/opt

19. After a few minutes, stop the dataflow.

This is the end of the exercise.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 112
113

Hands-On Exercise: Building Site-to-Site Dataflows


Files Used in This Exercise:
Log files (local) /var/log/nifi/nifi-app.log
/var/log/nifi/nifi-user.log
Output directory /tmp/nifi/remote-data
(remote)

In this exercise, you will use a remote process group to implement a Site-to-Site
dataflow.
You will use the existing flows to collect the output in nifi-app.log and nifi-
user.log, and send the data to a remote NiFi instance.
Important: This exercise depends on completion of Hands-On Exercise: Simplify
Dataflows Using Process Groups. If you did not complete that exercise, use the
Process Groups with I/O Ports Solution process group provided for you
on the Solutions area of your canvas. Copy the process groups and ports inside this
process group to an empty area of the root canvas prior to beginning this exercise.

Starting and Checking a Remote NiFi Instance


1. Using the URL provided to you at the beginning of the course, navigate to the
exercise environment portal.
Two VMs should be visible: a Master and a Peer. The Master VM should be running
(indicated by a green background) and the Peer should be powered off (indicated
by a gray background).

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 113
114

Note: These exercises refer to the Master VM (hostname master.example.com)


as the local host and to the Peer VM (hostname peer.example.com) as the
remote host.

2. Start the peer VM by clicking the play button (triangle icon).


If the peer is already in a running state, proceed to the next step.

3. Click the peer host thumbnail to open a new window showing the remote host
desktop. If the VM was not already running, wait 5-10 minutes for the VM and its
services to start.

4. Open a terminal window using the remote host desktop shortcut.

5. In the terminal window, run the service status verification script:

$ check-health.sh

6. Confirm that all required services are noted as “good.” Required services: HDFS,
Hive, Kafka, NiFi, NiFi Registry, Zookeeper.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 114
115

7. If any of the required services are “bad”, wait several minutes and then try again. (It
may take up to 10 minutes for all services to start fully.) If they are still not running
correctly, try restarting all the services by running the following script:

$ start-cluster.sh

Wait until the script completes and then check the health again.

8. Start the Firefox browser and click on the NiFi bookmark. The URL is http://
peer.example.com:8080/nifi.
Verify that you can see the NiFi canvas.

Building a Dataflow on the Remote NiFi Instance to Save Incoming


Files
9. In your browser, ensure you are viewing the peer (remote) NiFi instance.

10. Configure a PutFile processor on the peer system to write data received from the
master (local) system.

a. Add a PutFile processor to the canvas and name it Remote PutFile.

b. Auto-terminate both the success and failure relationships.

c. Set the Directory property to /tmp/nifi/remote-data.

d. Set the Conflict Resolution Strategy to replace and ensure that Create
Missing Directories is set to true.

11. Drag the input port icon to the canvas and name it Remote-InputPort-1.

12. Connect the new input port to the PutFile processor.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 115
116

Restructuring a Dataflow on the Local NiFi Instance to Send Files to


the Remote NiFi Instance
13. In your browser, go to the NiFi UI running on your local (master) NiFi instance.
Make sure you can see the dataflows that you created in previous exercises.

14. Locate the dataflow you created for the exercise entitled Hands-On Exercise:
Simplify Dataflows Using Process Groups. This dataflow includes three process
groups: Put app log Group, Put user log Group, and Save app-user
log Group.

15. Select all components of the dataflow, then copy and paste the dataflow to a new
area of the canvas.
The copied process group names will now have a prefix of Copy of followed by
the remainder of the process group name.

16. Rename the Put app log and Put user log process groups to be Remote
Put app log Group and Remote Put user log Group, respectively. Leave
the Save app-user log Group name as is, because it will be deleted.

17. Delete the connections between the Save app-user log Group and the other
two process groups. You should now have three unconnected process groups.

18. Delete the Save app-user log process group.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 116
117

19. Drag the remote process group icon onto the canvas. When prompted for a URL,
enter the URL of the remote NiFi instance: http://peer.example.com:8080/
nifi/.

Your dataflow should now look like this:

20. Connect the Put app log process group to the remote process group. The Create
Connection dialog window should show Put app log OutputPort for From Output
and Remote-Input-Port-1 for To Input.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 117
118

21. Similarly, connect the Put user log process group to the remote process group.
The Create Connection dialog window should show Put user log OutputPort for
From Output and Remote-Input-Port-1 for To Input.

Your dataflow on the master should now look something like this:

Starting the Site-to-Site Flow


22. In the peer (remote) NiFi UI, start all the components of the flow.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 118
119

23. In the master (local) NiFi UI, right-click on the remote process group and select
Enable Transmission to activate the remote flow:

24. Start the Put app log and Put user log process groups.

25. In the master NiFi UI, observe the statistics for the remote flow process.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 119
120

26. In a terminal window on the peer, verify that files were saved in the specified
destination directory.

$ ls -l /tmp/nifi/remote-data

27. Stop the dataflows running on both the remote and local NiFi systems.

28. In the master NiFi UI, disable transmission for the remote process group
component.

29. Move all the components you created in the master NiFi instance to a new process
group called Site-to-Site Dataflow.

30. Go to the URL for your exercise environment portal and shut down the peer host
using the "Power options for this VM" icon. Select the Shut down option.

31. Click the X on the peer browser tab to close it.

This is the end of the exercise.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 120
121

Hands-On Exercise: Monitoring and Reporting


In this exercise, will explore monitoring and reporting features of NiFi.

Configure Disk Usage Reporting


1. In a terminal window, use the df command to view the amount of space used by the
local Linux filesystem’s root directory.

$ df -h /

Take note of the value in the Use% column, indicating the percentage of the current
disk space used.

2. In the NiFi UI, select Controller Settings from the global menu.

3. Select the REPORTING TASKS tab.

4. Click the “Create a new reporting task” button (indicated by a plus sign).

5. Select the MonitorDiskUsage task and click ADD.

6. Notice that the warning icon displays next to the new reporting task, indicating that
it is misconfigured. Hover your pointer over the warning icon on the left side of the
entry to see the problem. In this case, it is indicating that you must set a value for
Directory Location.
Click the “Edit” button (indicated by a pencil icon) and configure the task as follows:
• Run schedule: 1 min
• Properties
◦ Threshold: 30% (This is not a typical threshold you would set in a production
environment, but to be able to test this task, the value must be lower than the
used disk space you noted earlier.)
◦ Directory Location: / (the local filesystem root directory)
◦ Directory Display Name: root (this is how the statistics will labeled in the
report)

7. Save the configuration by clicking APPLY.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 121
122

8. Start the MonitorDiskUsage reporting task by using the “Start” button to the
right of the task in the list.

9. Let the task run for a few minutes, then refresh the list of tasks using the refresh
icon ( ) in the bottom left.
You should see a red outline next to the task indicating that reporting events have
occurred—that is, the per-minute check of the disk usage was over the configured
threshold of 30%.
Hover your pointer over the indicator to see details.

Note that the directory display name you configured above (root) is used to
identify the disk that is being monitored.

10. Stop the reporting task using the “Stop” icon to the right of the task list, then close
the NiFi Settings window.

11. Select Bulletin Board from the global menu.

12. The list of bulletins should include the same warnings you saw in the notification
pop-up on the NiFi Settings page above.

13. Click the X in the upper right of the screen to close the Bulletin Board.

Explore Status History


14. Locate a processor in a flow you have recently run. Right click to open the context
menu and select View Status History.
Note: NiFi only keeps historical statistics for 24 hours. If it has been longer than
24 hours since the flow has run or processed a FlowFile, you will need to run the

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 122
123

flow again to generate some statistics to view. After running the flow, wait several
minutes for the statistics to become available.

15. You should see a graph showing statistics for the selected metric (FlowFiles Out by
default).
Try dragging a rectangle around a small portion of the lower graph. The larger
graph above will zoom in on the selected area.

16. Try selecting different types of metrics. You can view and select from the available
metrics by opening the selection box at the top of the upper graph.

17. When you are done exploring, close the Status History window.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 123
124

This is the end of the exercise.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 124
125

Hands-On Exercise: Adding Apache Hive Controller


In this exercise, you will configure a controller for use with Hive.
This exercise simply gives you experience configuring controllers. This controller is not
used subsequently.

Configuring a Hive Connection Pool Controller


1. Click on the canvas to make sure that no individual components are currently
selected.

2. Click on the "Configuration" icon on the Operate palette.

3. Select the CONTROLLER SERVICES tab.

4. Click the + on the right side of the screen to add a controller.

5. Filter on the string hive, select the HiveConnectionPool controller, and click
ADD.

6. Configure the controller properties as follows:


• Database Connection URL - jdbc:hive2://
master.example.com:10004/default

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 125
126

• Hive Configuration Resources - /etc/hive/conf/hive-site.xml


• Database User - hive.

7. Click APPLY.

8. Enable the controller with a Scope of Service only.


The controller should now be enabled.

If a controller stays in an Enabling state for a long period of


time, you can click the X at the upper-right of the NiFi Flow
Configuration screen and then bring the screen up again from the
Operate palette to see if it is now Enabled.

9. Click on the X at the upper-right of the screen to exit NiFi Flow Configuration.

This is the end of the exercise.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 126
127

Hands-On Exercise: Integrating Dataflows with Kafka


and HDFS
Files Used in This Exercise:
Dataflow data file ~/training_materials/nifi/data/
integrate/mydata.json
PutHDFS ~/training_materials/nifi/data/hdfs/
Configuration files hdfs-site.xml
~/training_materials/nifi/data/hdfs/
core-site.xml

In this exercise, you will explore NiFi integration with Kafka and HDFS.

Ingesting JSON Data to Kafka from NiFi


Publishing Data to a Kafka Broker from NiFi

1. On the NiFi UI, add a ListFile processor to the canvas with an Input Directory
of /home/training/training_materials/nifi/data/integrate. Name
it Integrate ListFile.

2. Add a FetchFile processor to the canvas and check failure, not.found, and
permission.denied for Automatically Terminate Relationships. Name it
Integrate FetchFile.

3. Drag a connection from the ListFile processor to the FetchFile processor for
the success relationship.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 127
128

4. The FetchFile processor is to obtain a JSON file containing sample data that will
be sent to Kafka. You may examine the file if desired using a terminal window:

$ cat ~/training_materials/nifi/data/integrate/mydata.json

{
"id":1,
"name":"Jobin",
"title":"consultant"
},
{
"id":2,
"name":"Sam",
"title":"developer"
},
{
"id":3,
"name":"Mary",
"title":"software engineer"
}
]

5. In a terminal window, create a Kafka topic named nifidata and then list Kafka
topics:

$ kafka-topics \
--create --zookeeper master.example.com:2181/kafka \
--replication-factor 1 --partitions 1 \
--topic nifidata
$ kafka-topics --list --zookeeper master.example.com:2181/
kafka

Toward the end of the output from the command to list Kafka topics,
you should see the string nifidata after a line that contains INFO
zookeeper.ZooKeeperClient: [ZooKeeperClient] Connected.

6. From the NiFi UI, configure a PublishKafkaRecord_2_0 processor.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 128
129

a. Add a PublishKafkaRecord_2_0 processor to the canvas and name it


Integrate PublishKakfaRecord_2_0.

b. Auto-terminate failure and success.

c. Set the following properties values:


• Kafka Brokers - master.example.com:9092.
• Topic Name - nifidata (the Kafka topic just created)
• Use Transactions - false.
Note: The port of 9092 given for the Kafka Brokers property is for a CM/
CDP environment. In an HDP/HDF environment, the port is 6667.

d. Click in the Value field for the Record Reader property to configure a Record
Reader controller service.

e. Click the down-arrow and select Create new service...

f. On the Add Controller Service screen, click the down-arrow under


Compatible Controller Services and select JsonTreeReader.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 129
130

After this selection, your screen should look like this:

g. Click CREATE.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 130
131

The JsonTreeReader should now be listed as the value for Record Reader.

h. Click on the arrow at the right of the Record Reader property. Click YES when
asked to Save changes before going to this Controller Service?.

i. You now should be viewing CONTROLLER SERVICES in the NiFi Flow


Configuration screen. The HiveConnectionPool controller should be visible
and enabled, as previously configured. The JsonTreeReader controller is not
yet configured or enabled.

j. View the configuration of the JsonTreeReader controller.


The SETTINGS show that this controller is referenced by the
PublishKafkaRecord_2_0 controller.

On the PROPERTIES tab, note that, by default, the schema of the data to be used
by the PublishKafkaRecord_2_0 processor is inferred. This means that the
controller will automatically create a schema by examining the data it finds in
the FlowFile.

k. Click CANCEL to keep the default settings for this controller.

l. Enable the JsonTreeReader controller with a Scope of Service only.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 131
132

m. Click the X in the upper-right of the NiFi Flow Configuration screen to return
to the canvas.

n. Return to the PROPERTIES tab of the PublishKafkaRecord_2_0


processor.

o. Click in the Value field for the Record Writer property to configure a Record
Writer controller service.

p. Click the down-arrow and select Create new service....

q. Under Compatible Controller Services, select JsonRecordSetWriter and click


CREATE.

The processor properties should now look like this:

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 132
133

r. Click on the arrow to the right of the Record Writer property to configure the
JsonRecordSetWriter controller service.
Click YES to Save changes before going to this Controller Service?.

s. View the configuration of the JsonRecordSetWriter controller service.


Note that the PublishKafkaRecord_2_0 processor is a Referencing
Component.
Review values for schema properties.

t. Click CANCEL to keep all default settings.

u. Enable the controller service with a Scope of Service only.

v. Click the X in the upper-right of the NiFi Flow Configuration screen to return
to the canvas.

7. Drag a connection from the FetchFile processor to the


PublishKafkaRecord_2_0 processor for the success relationship.
You dataflow should look something like this:

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 133
134

8. In a terminal window, run the following command to view messages pushed to


Kafka:

$ kafka-console-consumer \
--bootstrap-server master.example.com:9092 \
--topic nifidata

The prompt will not return as this command waits to display messages sent to
Kafka.

9. Clear the state of the ListFile processor.

10. Start the ListFile, FetchFile, and PublishKafkaRecord_2_0 processors.


Note the statistics on the processor surfaces.
View the messages for the nifidata topic returned in the terminal window where
the kafka-console-consumer command is running.
You should see output like this:

{"id":1,"name":"Jobin","title":"consultant"}
{"id":2,"name":"Sam","title":"developer"}
{"id":3,"name":"Mary","title":"software engineer"}

11. Stop the processors. Leave the terminal window open where the kafka-
console-consumer command is running.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 134
135

Consuming Data from Kafka and Storing it in HDFS

12. Configure a ConsumeKafka_2_0 processor.

a. Name the processor Integrate ConsumeKafka_2_0.

b. Set the following properties values:


• Kafka Brokers - master.example.com:9092
• Topic name - nifidata
• Group ID - 0001

Note that Offset Reset is set to latest. This picks up the most
recent data from the Kafka topic. If it is set to earliest, it will
pick up all data from the specified topic from the beginning.
This can result in a very large amount of data processed by the
ConsumeKafka_2_0 processor.

13. Configure a PutHDFS processor.

a. Name the processor Integrate PutHDFS.

b. Auto-terminate the failure and success relationships.

c. Set following properties values:


• Hadoop Configuration Resources -
/home/training/training_materials/nifi/data/hdfs/hdfs-
site.xml,

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 135
136

/home/training/training_materials/nifi/data/hdfs/core-
site.xml
• Directory - /tmp/nifidata
• Conflict Resolution Strategy - replace

14. Make a connection from the ConsumeKafka_2_0 processor to the PutHdfs


processor for the success relationship.
Your canvas should now look something like this:

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 136
137

When running this dataflow, the ListFile processor state


must be cleared every time you need to send the file through
the dataflow. The ConsumeKafka_2_0 processor is started
before the PublishKafkaRecord_2_0 processor so that the
ConsumeKafka_2_0 processor is ready to consume data from
the Kafka topic when data is sent through the flow.

15. Clear the ListFile processor state.

16. Start the ConsumeKafka_2_0 and PutHDFS processors.

17. Start the ListFile, FetchFile, and PublishKafkaRecord_2_0 processors.


Note that data has now passed out of the ConsumeKafka_2_0 processor to the
PutHDFS processor.

18. In the terminal window running the kafka-console-consumer command, you


should see more messages returned.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 137
138

19. In a different terminal window, run the following command to view the data written
to HDFS.

$ hdfs dfs -ls /tmp/nifidata

Files should be returned, corresponding to the rows of the mydata.json file


published to Kafka under the nifidata topic. Note the alphanumeric names of the
files, for example, cf9b5835-5a8f-492e-8d44-76d64dbe351a.

20. Choose some files and view their contents to verify rows written. Use the following
command for each filename, substituting alphanumeric filenames returned in the
previous command for filename in your commands:

$ hdfs dfs -cat /tmp/nifidata/filename

Each file should contain one of the lines returned in the terminal window running
the kafka-console-consumer command, for example:

{"id":1,"name":"Jobin","title":"consultant"}

If the prompt comes back on the same line as the output, just press
the Enter key to get the prompt on a line by itself.

21. Stop the processors.

22. Press Ctrl+C in the window running the the kafka-console-consumer


command to terminate it.

You have now set up different ways to interact with CDP Services:

• Read JSON data and write to Kafka.


• Consume Kafka data and write to HDFS.

This is the end of the exercise.

© Copyright 2010–2021 Cloudera. All Rights Reserved.


Not to be reproduced or shared without prior written consent from Cloudera. 138

You might also like