You are on page 1of 471

Cloudera DataFlow:

Flow Management with


Apache NiFi

210415
Introduction
Chapter 1
Course Chapters

▪ Introduction
▪ Introduction to Cloudera Flow Management
▪ Processors
▪ Connections
▪ Dataflows
▪ Process Groups
▪ FlowFile Provenance
▪ Dataflow Templates
▪ Apache NiFi Registry
▪ FlowFile Attributes
▪ NiFi Expression Language
▪ NiFi Architecture
▪ Dataflow Optimization
▪ Site-to-Site Dataflows
▪ Cloudera Edge Management and MiNiFi
▪ Monitoring and Reporting
▪ Controller Services
▪ Integrating NiFi with the Cloudera Ecosystem
▪ NiFi Security
▪ Conclusion

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 01-2
Trademark Information
▪ The names and logos of Apache products mentioned in Cloudera training
courses, including those listed below, are trademarks of the Apache Software
Foundation
Apache Accumulo Apache Hive Apache Pig
Apache Avro Apache Impala Apache Ranger
Apache Ambari Apache Kafka Apache Sentry
Apache Atlas Apache Knox Apache Solr
Apache Bigtop Apache Kudu Apache Spark
Apache Crunch Apache Lucene Apache Sqoop
Apache Druid Apache Mahout Apache Storm
Apache Flink Apache NiFi Apache Tez
Apache Flume Apache Oozie Apache Tika
Apache Hadoop Apache ORC Apache Zeppelin
Apache HBase Apache Parquet Apache ZooKeeper
Apache HCatalog Apache Phoenix

▪ All other product names, logos, and brands cited herein are the property of
their respective owners

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 01-3
Chapter Topics

Introduction
▪ About This Course
▪ Introductions
▪ About Cloudera
▪ About Cloudera Educational Services
▪ Course Logistics
▪ Hands-On Exercise: Using Your Exercise Environment

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 01-4
Course Objectives (1)
During this course, you will learn
▪ About Cloudera Flow Management in the context of the Cloudera Dataflow
Data-in-Motion Platform
▪ How NiFi and MiNiFi fit into the Cloudera Edge to AI paradigm
▪ About the NiFi Architecture, including standalone and clustered configurations
▪ About the key features, concepts, and benefits of NiFi
▪ How FlowFiles, processors, process groups, controllers, and connections work
together to define a NiFi dataflow

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 01-5
Course Objectives (2)
▪ To navigate, configure dataflows, and use dataflow information with the NiFi
User Interface
▪ To trace the life of data, its origin, transformation and destination, using data
provenance
▪ To organize and simplify dataflows
▪ How to manage dataflow versions using the NiFi Registry
▪ How to use the NiFi Expression Language to control dataflows
▪ About dataflow optimization methods and available monitoring and reporting
features

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 01-6
Chapter Topics

Introduction
▪ About This Course
▪ Introductions
▪ About Cloudera
▪ About Cloudera Educational Services
▪ Course Logistics
▪ Hands-On Exercise: Using Your Exercise Environment

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 01-7
Introductions
▪ About your instructor
▪ About you
─ Currently, what do you do at your workplace?
─ What is your experience with database technologies, programming, and
query languages?
─ How much experience do you have with UNIX or Linux?
─ What is your experience with big data?
─ What do you expect to gain from this course? What would you like to be
able to do at the end that you cannot do now?

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 01-8
Chapter Topics

Introduction
▪ About This Course
▪ Introductions
▪ About Cloudera
▪ About Cloudera Educational Services
▪ Course Logistics
▪ Hands-On Exercise: Using Your Exercise Environment

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 01-9
About Cloudera
 

THE ENTERPRISE DATA CLOUD COMPANY

 
▪ Cloudera (founded 2008) and Hortonworks (founded 2011) merged in 2019
▪ The new Cloudera improves on the best of both companies
─ Introduced the world’s first Enterprise Data Cloud
─ Delivers an comprehensive platform for any data from the Edge to AI
─ Leads in training, certification, support, and consulting for data professionals
─ Remains committed to open source and open standards

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 01-10
Cloudera Data Platform

A suite of products to collect, curate, report, serve, and predict

▪ Cloud native or bare metal ▪ Analytics from the Edge to AI


deployment
▪ Unified data control plane
▪ Powered by open source
▪ Shared Data Experience (SDX)

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 01-11
Cloudera Shared Data Experience (SDX)

▪ Full data lifecycle: Manages your data from ingestion to actionable insights
▪ Unified security: Protects sensitive data with consistent controls
▪ Consistent governance: Enables safe self-service access

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 01-12
Self-Serve Experiences for Cloud Form Factors
▪ Services customized for specific steps in the data lifecycle
─ Emphasize productivity and ease of use
─ Auto-scale compute resources to match changing demands
─ Isolate compute resources to maintain workload performance
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 01-13
Cloudera DataFlow
▪ Data-in-motion platform
▪ Reduces data integration
development time
▪ Manages and secures
your data from edge to
enterprise

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 01-14
Cloudera Machine Learning

▪ Cloud-native enterprise machine learning


─ Fast, easy, and secure self-service data science in enterprise environments
─ Direct access to a secure cluster running Spark and other tools
─ Isolated environments for running Python, R, and Scala code
─ Teams, version control, collaboration, and project sharing

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 01-15
Cloudera Data Hub

Customize your own experience in cloud form factors


▪ Integrated suite of analytic engines
▪ Cloudera SDX applies consistent security and governance
▪ Fueled by open source innovation

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 01-16
Chapter Topics

Introduction
▪ About This Course
▪ Introductions
▪ About Cloudera
▪ About Cloudera Educational Services
▪ Course Logistics
▪ Hands-On Exercise: Using Your Exercise Environment

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 01-17
Cloudera Educational Services
▪ We offer a variety of ways to take our courses
─ Instructor-led, both in physical and virtual classrooms
─ Private and customized courses also available
─ Self-paced, through Cloudera OnDemand
▪ Courses for all kinds of data professionals
─ Executives and managers
─ Data scientists and machine learning specialists
─ Data analysts
─ Developers and data engineers
─ System administrators
─ Security professionals

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 01-18
Cloudera Education Catalog
▪ A broad portfolio across multiple platforms
─ Not all courses shown here
─ See our website for the complete catalog
 
Administrator Security NiFi AWS Fundamentals
ADMINISTRATOR CDH|HDP|CDP CDH|HDP CDF for CDP Private Class
Public Class
Data Analyst Hive 3 Kudu Cloudera Data Warehouse OnDemand
DATA ANALYST CDH | CDP HDP CDH CDP

Spark Performance Stream Architecture


DEVELOPER & Spark Tuning Developer Kaa Search | Solr Workshop
DATA ENGINEER CDH | HDP CDH CDF CDP CDH CDH

Data Scienst Cloudera DS Workbench CML


DATA SCIENTIST CDH|HDP|CDP CDH | HDP CDP

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 01-19
Cloudera OnDemand
▪ Our OnDemand catalog includes
─ Courses for developers, data analysts, administrators, and data scientists,
updated regularly
─ Exclusive OnDemand-only courses, such as those covering security and
Cloudera Data Science Workbench
─ Free courses such as Essentials and Cloudera Director available to all with or
without an OnDemand account
▪ Features include
─ Video lectures and demonstrations with searchable transcripts
─ Hands-on exercises through a browser-based virtual environment
─ Discussion forums monitored by Cloudera course instructors
─ Searchable content within and across courses
▪ Purchase access to a library of courses or individual courses
▪ See the Cloudera OnDemand information page for more details or to make a
purchase, or go directly to the OnDemand Course Catalog

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 01-20
Accessing Cloudera OnDemand

▪ Cloudera OnDemand
subscribers can access
their courses online
through a web browser
 
 
 
 
 
 
 
 
▪ Cloudera OnDemand is also available through an
iOS app
─ Search for “Cloudera OnDemand” in the iOS
App Store

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 01-21
Cloudera Certification
▪ The leader in Apache Hadoop-based certification
▪ Cloudera certification exams favor hands-on, performance-based problems
that require execution of a set of real-world tasks against a live, working
cluster
▪ We offer two levels of certifications
─ Cloudera Certified Associate (CCA)
─ CCA Spark and Hadoop Developer
─ CCA Data Analyst
─ CCA CDH Administrator and CCA HDP Administrator
─ Cloudera Certified Professional (CCP)
─ CCP Data Engineer

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 01-22
Chapter Topics

Introduction
▪ About This Course
▪ Introductions
▪ About Cloudera
▪ About Cloudera Educational Services
▪ Course Logistics
▪ Hands-On Exercise: Using Your Exercise Environment

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 01-23
Logistics
▪ Class start and finish time
▪ Lunch
▪ Breaks
▪ Restrooms
▪ Wi-Fi access
▪ Virtual machines

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 01-24
Downloading the Course Materials
1. Log in using https://university.cloudera.com/user
▪ If necessary, use the Register Now link on the right to create an account
▪ If you have forgotten your password, use the Reset Password link

2. Scroll down to find this course


▪ If necessary, click My Learning under the photo
▪ You may also want to use the Current filter
3. Select the course title
4. Click the Resources tab
5. Click a file to download it

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 01-25
Chapter Topics

Introduction
▪ About This Course
▪ Introductions
▪ About Cloudera
▪ About Cloudera Educational Services
▪ Course Logistics
▪ Hands-On Exercise: Using Your Exercise Environment

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 01-26
Hands-On Exercise: Using Your Exercise Environment
▪ In this exercise, you will learn to access and work in your exercise
environment
▪ Please refer to the Hands-On Exercise Manual for instructions

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 01-27
Introduction to Cloudera Flow
Management
Chapter 2
Course Chapters

▪ Introduction
▪ Introduction to Cloudera Flow Management
▪ Processors
▪ Connections
▪ Dataflows
▪ Process Groups
▪ FlowFile Provenance
▪ Dataflow Templates
▪ Apache NiFi Registry
▪ FlowFile Attributes
▪ NiFi Expression Language
▪ NiFi Architecture
▪ Dataflow Optimization
▪ Site-to-Site Dataflows
▪ Cloudera Edge Management and MiNiFi
▪ Monitoring and Reporting
▪ Controller Services
▪ Integrating NiFi with the Cloudera Ecosystem
▪ NiFi Security
▪ Conclusion

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-2
Introduction to Cloudera Flow Management
After completing this chapter, you will be able to
▪ Describe how flow management fits into an enterprise data solution
▪ Summarize how Cloudera Flow Management uses Apache NiFi to manage
dataflow
▪ Explain the major areas of the NiFi web user interface

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-3
Chapter Topics

Introduction to Cloudera Flow Management


▪ Overview of Cloudera Flow Management and NiFi
▪ The NiFi User Interface
▪ Instructor-Led Demonstration: NiFi User Interface
▪ Hands-On Exercise: Build Your First Dataflow
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-4
What Is Dataflow?
▪ The automated and managed flow of information between systems
(Producers and Consumers)
▪ Dataflow challenges
─ Scalable for large volumes of data
─ Maintainable
─ Data security and governance
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-5
Cloudera Data Platform
▪ A suite of products to collect, curate, report, serve, and predict
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-6
Cloudera DataFlow

▪ Cloudera DataFlow (CDF) is


an Enterprise Data-in-Motion
platform
─ Ingests, curates, and analyzes
data for key insights and
immediate actionable
intelligence
─ Scalable, real-time streaming
analytics platform
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-7
Edge and Flow Management

▪ Edge and Flow Management is


part of CDF
─ Powered by Apache NiFi
─ No-code data ingestion and
management solution
─ Flow-based programming
model

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-8
What Is Apache NiFi?
▪ A system to process and distribute data
─ Automates the flow of data within or between systems
─ Provides a web-based UI for creating, monitoring, and controlling dataflow
─ Visual programming paradigm allows no-code implementation

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-9
A Brief History of NiFi

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-10
Why Use Cloudera Flow Management and NiFi?
▪ Runtime configuration of the flow of data
▪ Detailed history of each data item through entire flow
▪ Extensible through development of custom components
▪ Secure communication with other NiFi instances and external systems

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-11
NiFi Key Features (1)
▪ Guaranteed delivery
▪ Data buffering with back pressure and pressure release
▪ Control quality of service for different flows
─ Balancing latency against throughput
─ Loss tolerance
▪ Data provenance

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-12
NiFi Key Features (2)
▪ Recovery/recording with a rolling log of fine-grained history (provenance)
▪ Visual command and control
▪ Flow templates
▪ Security and multi-tenant authorization
▪ Clustering

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-13
NiFi in Context

Integration/
Processing
Ingestion
Framework
Framework
(Flink, Spark)
(Camel, Flume)

Apache NiFi

ETL Messaging
(Informatica) Bus
(Kafka, JMS)

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-14
NiFi and Processing Frameworks Processing
Framework
(Flink, Spark)

Apache NiFi Processing frameworks

▪ Simple event processing ▪ More complex data processing


▪ Scales up to take advantage of better ─ Joining data from multiple streams
hardware/more resources ─ Analyzing data by time windows
▪ Feeds data to data processing ▪ Can scale out to thousands of nodes
frameworks
▪ Not designed to collect data or
manage dataflow

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-15
NiFi and Messaging Bus Services Messaging
Bus
(Kafka, JMS)

Apache NiFi Messaging bus services

▪ Centralized management, from edge ▪ Low latency


to core ▪ Great data durability
▪ Traceability through data provenance ▪ Decentralized management
▪ Interactive command and control (producers and consumers)
▪ Dataflow management such as ▪ Low broker maintenance for dynamic
prioritization and back pressure consumer-side updates
▪ Visual representation of global
dataflow

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-16
NiFi and Integration/Ingestion Frameworks
Integration/
Ingestion
Framework
(Camel, Flume)

Apache NiFi Integration/Ingestion frameworks

▪ Dataflow management tool ▪ Integration tool focused on ingestion


▪ End-user facing ▪ Developer facing
▪ Out-of-the-box solution ▪ A set of tools to orchestrate workflow
▪ Visual representation of dataflow ▪ Fixed design and deploy pattern
▪ Interactive design and management ▪ Custom code needed to optimize

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-17
NiFi and ETL Tools ETL
(Informatica)

NiFi Extract/Transform/Load tools

▪ For structured or unstructured data ▪ Structured data only—requires


▪ Can use schema for structured data schema
▪ Designed for databases/data
▪ Minimal data modeling effort required
warehouses
▪ ETL operations based on data
modeling
▪ Highly efficient, optimized
performance
▪ Does not address dataflow problems

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-18
Key NiFi Concepts
▪ FlowFile—unit of data (with associated metadata and attributes) moving
through a dataflow
▪ Processor—performs work on FlowFiles
▪ Connection—links Processors within a dataflow
▪ Dataflow—collection of Processors and Connections to distribute FlowFiles
▪ Process Group—set of Processors and their Connections
▪ Provenance—a record of what has happened to a FlowFile
 
Connector

Dataflow

FlowFiles
Processor

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-19
Chapter Topics

Introduction to Cloudera Flow Management


▪ Overview of Cloudera Flow Management and NiFi
▪ The NiFi User Interface
▪ Instructor-Led Demonstration: NiFi User Interface
▪ Hands-On Exercise: Build Your First Dataflow
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-20
The NiFi Web User Interface
Component Toolbar

Global
Menu

Status Bar

Search

Navigate
Palette

Operate
Palette Canvas

Process Group Breadcrumbs

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-21
Component Toolbar
▪ Create a dataflow by dragging dataflow components from the toolbar onto
the canvas
Input Process
port group Funnel Label

Processor Output Remote Template


port process
groups

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-22
Operate Palette
▪ Manipulate existing dataflow components using the Operate palette
Start/Stop

Create/Upload
Enable/Disable flow template

Configure

Copy/Paste Delete
Change color

Group components

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-23
Search Field
▪ Quickly find components by name, type, property values, and so on
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-24
Global Menu

▪ Manage user access, set system properties,


and so on

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-25
UI Navigation
▪ The navigation palette lets you zoom in and out, and adjust the view of the
canvas
▪ Breadcrumbs let you navigate between the root canvas and Process Groups
(including nested Process Groups)
 

Navigate
Panel

Bird's-eye
view

Process Group
Breadcrumbs

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-26
Chapter Topics

Introduction to Cloudera Flow Management


▪ Overview of Cloudera Flow Management and NiFi
▪ The NiFi User Interface
▪ Instructor-Led Demonstration: NiFi User Interface
▪ Hands-On Exercise: Build Your First Dataflow
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-27
Chapter Topics

Introduction to Cloudera Flow Management


▪ Overview of Cloudera Flow Management and NiFi
▪ The NiFi User Interface
▪ Instructor-Led Demonstration: NiFi User Interface
▪ Hands-On Exercise: Build Your First Dataflow
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-28
Hands-On Exercise: Build Your First Dataflow
▪ In this exercise, you will build a simple dataflow
▪ Please refer to the Hands-On Exercise Manual for instructions

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-29
Chapter Topics

Introduction to Cloudera Flow Management


▪ Overview of Cloudera Flow Management and NiFi
▪ The NiFi User Interface
▪ Instructor-Led Demonstration: NiFi User Interface
▪ Hands-On Exercise: Build Your First Dataflow
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-30
Essential Points
▪ Cloudera DataFlow (CDF) is part of the Cloudera Data Platform (CDP)
─ Enterprise grade solution for data ingestion and streaming
─ Cloudera uses Apache NiFi for Edge and Flow Management
▪ Apache NiFi
─ Dataflow builder, no code required
─ Guaranteed delivery
─ Data provenance
▪ NiFi elements
─ FlowFiles
─ Processors
─ Connections
─ Dataflows
─ Process Groups

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-31
Processors
Chapter 3
Course Chapters

▪ Introduction
▪ Introduction to Cloudera Flow Management
▪ Processors
▪ Connections
▪ Dataflows
▪ Process Groups
▪ FlowFile Provenance
▪ Dataflow Templates
▪ Apache NiFi Registry
▪ FlowFile Attributes
▪ NiFi Expression Language
▪ NiFi Architecture
▪ Dataflow Optimization
▪ Site-to-Site Dataflows
▪ Cloudera Edge Management and MiNiFi
▪ Monitoring and Reporting
▪ Controller Services
▪ Integrating NiFi with the Cloudera Ecosystem
▪ NiFi Security
▪ Conclusion

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-2
Processors
After completing this chapter, you will be able to
▪ Explain what a Processor is
▪ Demonstrate how to add a Processor to your Apache NiFi canvas
▪ Interpret the information on the Processor surface panel
▪ Configure a Processor

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-3
Chapter Topics

Processors
▪ Overview of Processors
▪ Processor Surface Panel
▪ Processor Configuration
▪ Hands-On Exercise: Start Building a Dataflow Using Processors
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-4
What is a Processor?
▪ A Processor performs work on FlowFiles
▪ Processors can
─ Poll for incoming FlowFiles
─ Pull data from external sources
─ Publish data to external sources
─ Route, transform, or extract information from FlowFiles
▪ A dataflow consists of a series of Processors joined by Connections for specific
relationships

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-5
Over 300 Processors
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-6
Types of Processor Functions
● Data Ingestion
● Data Egress
● Data Transformation
● FlowFile Attributes
● Control

Core
Functions

Additional Functions

● Routing and Mediation


● Database Access
● System Interaction
● Web Services (HTTP and AWS)

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-7
Examples of Data Ingestion Processors
▪ FetchFile: Streams the contents of a file from a local disk into NiFi (then
optionally deletes or moves file)
▪ ListHDFS: Monitors a directory in HDFS and emits a FlowFile for each file with
filename as its content
▪ FetchHDFS: On receiving FlowFile from ListHDFS, it fetches the actual files
from HDFS to NiFi
▪ FetchFTP / FetchSFTP: Downloads the contents of a remote file via FTP/
SFTP into NiFi
▪ ConsumeKafka: Receives messages from Apache Kafka
▪ GetTwitter: Allows a filter to listen to the Twitter endpoint, create FlowFile
for each tweet that is received

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-8
Data Egress Processors
▪ PutEmail: Sends an E-mail to the configured recipients
▪ PutFTP: Copies the contents of a FlowFile to a remote FTP Server
▪ PutSFTP: Copies the contents of a FlowFile to a remote SFTP Server
▪ PutSQL: Executes the contents of a FlowFile as a SQL DML Statement
(INSERT or UPDATE)
▪ ProduceKafka: Sends the contents of a FlowFile to Kafka as a message
▪ PutMongo: Sends the contents of a FlowFile to Mongo as an INSERT or an
UPDATE

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-9
Examples of Data Transformation Processors
▪ CompressContent: Compress or decompress content
▪ ConvertCharacterSet: Convert character set to encode the content from
one character set to another
▪ EncryptContent: Encrypt or decrypt content
▪ ReplaceText: Use regular expressions to modify textual content
▪ TransformXml: Apply an XSLT transform to XML content

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-10
Examples of Routing and Mediation Processors
▪ ControlRate: Throttle the rate at which FlowFiles can flow through one part
of the flow
▪ DistributeLoad: Load balance by distributing only a portion of data to each
user-defined relationship
▪ MonitorActivity: Sends a notification when a user-defined period of time
elapses without any data
▪ RouteOnAttribute: Route FlowFile based on the attributes that it contains
▪ ScanAttribute: Scans the user-defined set of Attributes on a FlowFile
▪ RouteOnContent: Search content in FlowFile, if it matches—routed to the
configured relationship
▪ ScanContent: Search content of a FlowFile for terms that are present in a
user-defined dictionary
▪ ValidateXml: Validation XML Content against an XML Schema

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-11
Examples of Database Access Processors
▪ ConvertJSONToSQL: Convert a JSON document into a SQL INSERT or
UPDATE command
▪ ExecuteSQL: Executes a user-defined SQL SELECT command, writing the
results in Avro format
▪ PutSQL: Updates a database by executing the SQL DDM statement defined by
the FlowFile’s content
▪ GetHbase: Polls HBase for any records in the specified table
▪ PutHbaseCell: Adds the Contents of a FlowFile to HBase as the value of a
single cell
▪ PutHBaseJSON: Adds rows to HBase based on the contents of incoming JSON
documents

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-12
Examples of Attribute Extraction Processors
▪ EvaluateJsonPath: User give JSONPath expressions and are evaluated
against the JSON Content
▪ ExtractText: Contents of a FlowFile are extracted using regular expressions
▪ HashAttribute: Performs a hashing function against the concatenation of
existing attributes
▪ HashContent: Performs a hashing function against the content of a FlowFile
and add it as an attribute
▪ IdentifyMimeType: Evaluates the content of a FlowFile to determine the
MIME type of the file
▪ UpdateAttribute: Adds or updates any number of user-defined attributes
to a FlowFile

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-13
Examples of System Interaction Processors
▪ ExecuteProcess: Runs a user-defined operating system command and is a
Source Processor.
▪ ExecuteStreamCommand: Runs the user-defined operating system
command and must be fed incoming FlowFiles in order to do its work

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-14
Examples of HTTP Processors
▪ InvokeHTTP: Performs an HTTP request that is configured by the user (such
as GET or POST)
▪ HandleHttpRequest: Is a Source Processor that starts an HTTP(S) server
(similar to ListenHTTP)
▪ HandleHttpResponse: Sends a response back to the client after the
FlowFile has finished processing

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-15
Examples of Amazon Web Services Processors
▪ FetchS3Object: Fetches the content of an object stored in Amazon Simple
Storage Service (S3)
▪ PutS3Object: Writes the contents of a FlowFile to an Amazon S3 object as
configured
▪ PutSNS: Sends the contents of a FlowFile as a notification to the Amazon
Simple Notification Service (SNS)
▪ GetSQS: Pulls a message from the Amazon Simple Queuing Service (SQS) and
writes to FlowFile
▪ PutSQS: Sends the contents of a FlowFile as a message to the Amazon Simple
Queuing Service (SQS)
▪ DeleteSQS: Deletes a message from the Amazon Simple Queuing Service
(SQS)

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-16
Adding a Processor (1)
▪ Add a Processor by dragging the Processor icon from the toolbar to the canvas
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-17
Adding a Processor (2)
▪ NiFi prompts for Processor type when adding a Processor
▪ Find the Processor you want by
─ Filtering by category
─ Searching by type name
─ Choosing source
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-18
Chapter Topics

Processors
▪ Overview of Processors
▪ Processor Surface Panel
▪ Processor Configuration
▪ Hands-On Exercise: Start Building a Dataflow Using Processors
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-19
Processor UI: Surface Panel Overview
▪ The surface panel is visible for components on the canvas
 
Processor name
Status Bulletin
indicator

Processor
type Active
Tasks

5-minute
statistics

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-20
Processor Name and Type
Processor Name
▪ The user-defined name of the Processor
▪ By default, the name of the Processor is the same as the Processor type
 
Processor Type
▪ Shows the type of the Processor, such as FetchFile
 
Processor name

Processor
type

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-21
Bulletin Indicator
▪ Icon only appears when a bulletin exists for this Processor
▪ Hover over indicator to see Processor messages such as warnings and errors
that have occurred in the last five minutes
─ You can also see bulletins on the global bulletin board
▪ You can configure which type of bulletins should be displayed
─ The default value is WARN (displays warnings and errors but not INFO
messages)
▪ If the instance of NiFi is clustered, it will also show the node that emitted the
bulletin
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-22
Active Tasks
▪ The number of tasks that this Processor is currently executing
▪ Number is constrained by the Concurrent tasks setting in Processor
configuration
▪ In the example, the Processor is currently performing one task
▪ If the NiFi instance is clustered, this value is the cumulative number of tasks
executing across all nodes in the cluster
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-23
Status Indicator
▪ Shows the current status of the Processor
─ Running: The Processor is currently running
─ Stopped: The Processor is valid and enabled but is not running

Invalid: The Processor is enabled but the configuration is not valid and it
cannot be started
─ Hover over icon to see why the configuration is not valid

Disabled: The Processor is not running and cannot be started until it has
been enabled

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-24
5-Minute Statistics
▪ The Processor shows statistics representing the last five minutes
▪ If a clustered instance, statistics are for all of the nodes combined
▪ These metrics are
─ In: The number of FlowFiles and size of data pulled from the queues of its
incoming Connections
─ Read/Write: The total size of the FlowFile content read/written to disk
─ Out: The number of FlowFiles and size of data transferred to its outbound
Connections
─ Tasks/Time: The number of tasks and cumulative time the tasks took
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-25
Chapter Topics

Processors
▪ Overview of Processors
▪ Processor Surface Panel
▪ Processor Configuration
▪ Hands-On Exercise: Start Building a Dataflow Using Processors
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-26
Processor UI: Processor Configuration Panel
▪ Three ways to view a component’s configuration panel
─ Select component and click Configure button on the Operate palette
─ Right-click to see the context menu
─ Double-click
▪ Component must be in a stopped state to modify configuration
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-27
Processor Configuration Panel: SETTINGS Tab

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-28
Processor Configuration Panel: SCHEDULING Tab
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-29
Processor Configuration Panel: PROPERTIES Tab
▪ The PROPERTIES tab provides a mechanism to configure Processor-specific
behavior
─ Different Processors have different properties
─ Some support dynamically created properties
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-30
Processor Configuration Panel: COMMENTS Tab
▪ The COMMENTS tab provides an opportunity to document your Processor to
help in future maintenance
▪ Comments are optional
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-31
Chapter Topics

Processors
▪ Overview of Processors
▪ Processor Surface Panel
▪ Processor Configuration
▪ Hands-On Exercise: Start Building a Dataflow Using Processors
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-32
Hands-On Exercise: Start Building a Dataflow Using
Processors
▪ In this exercise, you will start building a dataflow by placing and configuring
processors
▪ Please refer to the Hands-On Exercise Manual for instructions

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-33
Chapter Topics

Processors
▪ Overview of Processors
▪ Processor Surface Panel
▪ Processor Configuration
▪ Hands-On Exercise: Start Building a Dataflow Using Processors
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-34
Essential Points
▪ What are Processors in NiFi?
─ Elements that do work in NiFi
─ Poll for incoming FlowFiles
─ Pull data from external sources
─ Publish data to external sources
─ Route, transform, or extract information from FlowFiles
▪ How do you use a Processor?
─ Click and drag the Processor icon from the toolbar onto the canvas
─ Use the filter tools to find the Processor you need
─ Configure the Processor using the configuration panel
─ Use the Processor surface panel to control and monitor the Processor

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-35
Connections
Chapter 4
Course Chapters

▪ Introduction
▪ Introduction to Cloudera Flow Management
▪ Processors
▪ Connections
▪ Dataflows
▪ Process Groups
▪ FlowFile Provenance
▪ Dataflow Templates
▪ Apache NiFi Registry
▪ FlowFile Attributes
▪ NiFi Expression Language
▪ NiFi Architecture
▪ Dataflow Optimization
▪ Site-to-Site Dataflows
▪ Cloudera Edge Management and MiNiFi
▪ Monitoring and Reporting
▪ Controller Services
▪ Integrating NiFi with the Cloudera Ecosystem
▪ NiFi Security
▪ Conclusion

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 04-2
Connections
After completing this chapter, you will be able to
▪ Define what is meant by a Connection in Apache NiFi
▪ Describe Connection relationships
▪ View and clear queues in NiFi

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 04-3
Chapter Topics

Connections
▪ Overview of Connections
▪ Connection Configuration
▪ Connector Context Menu
▪ Hands-On Exercise: Connect Processors in a Dataflow
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 04-4
Connection Overview
▪ Connections connect components in a dataflow
─ Define the path a FlowFile will take
▪ Connections queue (buffer) FlowFiles and pass them to the downstream
Processor
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 04-5
Connections and Relationships
▪ Processors choose which Connection to pass a FlowFile to using relationships
─ For example, many Processors have a success and a failure relationship
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 04-6
Creating a Connection
▪ Hover pointer over a Processor to reveal the Connection icon

▪ Drag the icon from one component to another until the second component is
highlighted
▪ The Create Connection dialog appears
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 04-7
Bending Connections
▪ Add a bend point (or elbow) to organize and neaten canvas
▪ Double-click on the Connection to add a bend
▪ Add any number of bend points
▪ To remove a bend point, double-click it again
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 04-8
Loop Back for Failed FlowFiles
▪ Add a Connection that loops back to the same Processor to re-process
FlowFiles that fail
▪ Drag the Connection icon away and then back to the same Processor
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 04-9
Chapter Topics

Connections
▪ Overview of Connections
▪ Connection Configuration
▪ Connector Context Menu
▪ Hands-On Exercise: Connect Processors in a Dataflow
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 04-10
Configuring Connections: DETAILS Tab
▪ The Details tab provides information about a Connection
▪ Every Connection must consist of one or more relationships
▪ If multiple Connections from a Processor have the same relationship, FlowFiles
for that relationship are automatically cloned
─ A copy is sent to each of the Connections
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 04-11
Configuring Connections: SETTINGS Tab
▪ The SETTINGS tab provides the ability to configure the Connection’s name,
FlowFile expiration, back pressure thresholds, and prioritization
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 04-12
FlowFile Expiration
▪ File expiration specifies that a FlowFile that cannot be processed in a timely
fashion should be automatically deleted
▪ The expiration period is based on the time that the data entered the NiFi
instance
▪ The default value of 0 sec indicates that FlowFiles will never expire
▪ When a file expiration period is set, a small clock icon appears on the
Connection label
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 04-13
Chapter Topics

Connections
▪ Overview of Connections
▪ Connection Configuration
▪ Connector Context Menu
▪ Hands-On Exercise: Connect Processors in a Dataflow
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 04-14
List Queue
▪ The List Queue option allows you to view a list of queued FlowFiles
─ Only shows the first 100 FlowFiles
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 04-15
Empty Queue
▪ The Empty queue option allows you to delete all FlowFiles in a queue
─ Useful during testing
─ Upstream and downstream Processors must be in a stopped state

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 04-16
Chapter Topics

Connections
▪ Overview of Connections
▪ Connection Configuration
▪ Connector Context Menu
▪ Hands-On Exercise: Connect Processors in a Dataflow
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 04-17
Hands-On Exercise: Connect Processors in a Dataflow
▪ In this exercise, you will complete building a dataflow by adding connections
▪ Please refer to the Hands-On Exercise Manual for instructions

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 04-18
Chapter Topics

Connections
▪ Overview of Connections
▪ Connection Configuration
▪ Connector Context Menu
▪ Hands-On Exercise: Connect Processors in a Dataflow
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 04-19
Essential Points
▪ What are Connections in NiFi?
─ Connections connect components in a dataflow
─ Define the path a FlowFile will take
─ Click and drag the Connection icon between Processors to create a
Connection
▪ What are Connection relationships?
─ Processors choose which Connection to pass a FlowFile to using relationships
─ Relationships are named to indicate the result of processing a FlowFile
─ The most common relationships are "Success" and "Failure"
▪ What are Connection queues?
─ NiFi can buffer FlowFiles using queues
─ Tools including FlowFile expiration, back pressure, and prioritization can be
used to manage FlowFiles in queues

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 04-20
Dataflows
Chapter 5
Course Chapters

▪ Introduction
▪ Introduction to Cloudera Flow Management
▪ Processors
▪ Connections
▪ Dataflows
▪ Process Groups
▪ FlowFile Provenance
▪ Dataflow Templates
▪ Apache NiFi Registry
▪ FlowFile Attributes
▪ NiFi Expression Language
▪ NiFi Architecture
▪ Dataflow Optimization
▪ Site-to-Site Dataflows
▪ Cloudera Edge Management and MiNiFi
▪ Monitoring and Reporting
▪ Controller Services
▪ Integrating NiFi with the Cloudera Ecosystem
▪ NiFi Security
▪ Conclusion

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 05-2
Dataflows
After completing this chapter, you will be able to
▪ Manage and use Processor states in a dataflow
▪ Build a dataflow using Processors and Connections
▪ Create a flow that demonstrates back pressure
▪ Describe queue prioritization

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 05-3
Chapter Topics

Dataflows
▪ Command and Control of a Dataflow
▪ Processor Relationships
▪ Back Pressure
▪ Prioritizers
▪ Labels
▪ Hands-On Exercise: Build a More Complex Dataflow
▪ Hands-On Exercise: Creating a Fork Using Relationships
▪ Hands-On Exercise: Set Back Pressure Thresholds
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 05-4
Starting and Stopping Processors
▪ New Processors are initially in a stopped state
─ You must start a Processor to start processing FlowFiles
─ You can only start enabled Processors with valid configurations
▪ Once started, you can stop the Processor at any time
─ Active tasks will be completed before Processor stops
▪ Before configuring, disabling, or deleting a component, you must stop it
▪ Stopping a Process Group stops all Processors in the group
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 05-5
Enabling and Disabling Components
▪ By default, Processors are enabled
▪ You might wish to disable components when they are part of a dataflow that
is still being assembled
─ Helps distinguish between components intentionally stopped and stopped
temporarily
▪ You must enable a disabled component before it can be started
─ Enable a component by clicking the enable icon in the Operate palette, or on
configuration settings tab
 

Enable processor
Disable processor

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 05-6
Processor State Management (1)
▪ Most types of Processors are stateless
─ New tasks do not require information about prior processing
▪ A few Processors are stateful
─ State information is preserved across tasks
▪ Example: TailFail tasks store the last location read from the file
─ Next task can read from location where prior task left off

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 05-7
Processor State Management (2)
▪ Stateful Processors maintain state throughout the dataflow lifecycle
▪ You may need to view or clear a Processor’s state for testing or
troubleshooting
─ Example: you have modified a dataflow fed by TailFile and need to test
it on the entire file
▪ You must stop a Processor before clearing its state
▪ Use the Processor context menu View State option
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 05-8
Building a Flow
1. Add and configure Processors
2. Add and configure Connections
3. Start Processors
 
Add processors Add connections

Configure connections

Start the processors

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 05-9
Chapter Topics

Dataflows
▪ Command and Control of a Dataflow
▪ Processor Relationships
▪ Back Pressure
▪ Prioritizers
▪ Labels
▪ Hands-On Exercise: Build a More Complex Dataflow
▪ Hands-On Exercise: Creating a Fork Using Relationships
▪ Hands-On Exercise: Set Back Pressure Thresholds
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 05-10
Relationships and Connections
▪ Every type of Processor sends each FlowFile through one or more routing
streams called Relationships
─ Every outgoing FlowFile is assigned to a Relationship after processing
─ Relationships determine which Processor(s) a FlowFile passes to next
▪ Connections to multiple receiving Processors can be created from a single
Relationship
─ FlowFile will be cloned
─ A copy will be sent to each of those Connections

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 05-11
Relationships and Connections (2)
▪ Every Relationship must either be auto-terminated or have a Connection to at
least one other Processor
─ Auto-terminated Relationships stop the flow of FlowFiles
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 05-12
Relationships and Connections (3)
▪ Connections are configured with Relationships from the upstream Processor
─ Connections may have one or more Relationships assigned
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 05-13
Chapter Topics

Dataflows
▪ Command and Control of a Dataflow
▪ Processor Relationships
▪ Back Pressure
▪ Prioritizers
▪ Labels
▪ Hands-On Exercise: Build a More Complex Dataflow
▪ Hands-On Exercise: Creating a Fork Using Relationships
▪ Hands-On Exercise: Set Back Pressure Thresholds
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 05-14
Back Pressure
Back pressure
threshold

P1 P2
A
FlowFile
Connection
Processor

B P1 P2

C P1 P2

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 05-15
Back Pressure Thresholds
▪ Thresholds determine how much data or how many FlowFiles a queue can
contain before back pressure is applied
─ Object threshold—the number of FlowFiles (default is 10000)
─ Size threshold―the total size of data in the queue (default is 1 GB)
▪ Exceeding thresholds will not prevent the feeding Processor from completing a
running task
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 05-16
Back Pressure Indicators
▪ When back pressure is enabled, small progress bars appear on the Connection
label
▪ The progress bars change color based on the queue percentage
─ Green (0-60%)
─ Yellow (61-85%)
─ Red (86-100%)
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 05-17
Chapter Topics

Dataflows
▪ Command and Control of a Dataflow
▪ Processor Relationships
▪ Back Pressure
▪ Prioritizers
▪ Labels
▪ Hands-On Exercise: Build a More Complex Dataflow
▪ Hands-On Exercise: Creating a Fork Using Relationships
▪ Hands-On Exercise: Set Back Pressure Thresholds
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 05-18
Queue Prioritization
▪ Prioritizers control which FlowFiles in a queue are processed first
▪ Determine what is important for your data
─ time based
─ arrival order
─ importance of a dataset
▪ Choose a prioritizer for each Connection
▪ Funnel many Connections down to a single Connection to prioritize across
datasets
▪ Develop your own prioritizer if needed

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 05-19
Types of Prioritizers
▪ FirstInFirstOutPrioritizer
▪ NewestFlowFileFirstPrioritizer
▪ OldestFlowFileFirstPrioritizer
─ This is the default scheme if no prioritizers are selected
▪ PriorityAttributePrioritizer
─ Given two FlowFiles that both have a priority attribute, the one that has
the highest priority value will be processed first

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 05-20
Configuring Queue Prioritization

▪ Choose a prioritizer buy dragging from the top (Available Prioritizers) to the
bottom (Selected Prioritizers)
▪ Multiple prioritizers can be selected
▪ The prioritizer at the top of the Selected prioritizers list is the highest priority
─ If two FlowFiles have the same value according to this prioritizer, the second
prioritizer will determine which FlowFile to process first, and so on
▪ Remove a prioritizer by dragging it to the Available prioritizers list

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 05-21
Chapter Topics

Dataflows
▪ Command and Control of a Dataflow
▪ Processor Relationships
▪ Back Pressure
▪ Prioritizers
▪ Labels
▪ Hands-On Exercise: Build a More Complex Dataflow
▪ Hands-On Exercise: Creating a Fork Using Relationships
▪ Hands-On Exercise: Set Back Pressure Thresholds
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 05-22
Adding Labels
▪ Labels are used to add annotations to the canvas
─ Useful for documenting parts of the dataflow
▪ When a label is dropped onto the canvas, it is created with a default size
▪ The label can be resized by dragging the handle in the bottom-right corner
▪ A label has no text when initially created
─ Add text by right-clicking on the label and choosing Configure

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 05-23
Chapter Topics

Dataflows
▪ Command and Control of a Dataflow
▪ Processor Relationships
▪ Back Pressure
▪ Prioritizers
▪ Labels
▪ Hands-On Exercise: Build a More Complex Dataflow
▪ Hands-On Exercise: Creating a Fork Using Relationships
▪ Hands-On Exercise: Set Back Pressure Thresholds
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 05-24
Hands-On Exercise: Build a More Complex Dataflow
▪ In this exercise, you will add processors to construct a more complex dataflow
▪ Please refer to the Hands-On Exercise Manual for instructions

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 05-25
Chapter Topics

Dataflows
▪ Command and Control of a Dataflow
▪ Processor Relationships
▪ Back Pressure
▪ Prioritizers
▪ Labels
▪ Hands-On Exercise: Build a More Complex Dataflow
▪ Hands-On Exercise: Creating a Fork Using Relationships
▪ Hands-On Exercise: Set Back Pressure Thresholds
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 05-26
Hands-On Exercise: Creating a Fork Using Relationships
▪ In this exercise, you will route FlowFiles through a dataflow based on
relationships
▪ Please refer to the Hands-On Exercise Manual for instructions

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 05-27
Chapter Topics

Dataflows
▪ Command and Control of a Dataflow
▪ Processor Relationships
▪ Back Pressure
▪ Prioritizers
▪ Labels
▪ Hands-On Exercise: Build a More Complex Dataflow
▪ Hands-On Exercise: Creating a Fork Using Relationships
▪ Hands-On Exercise: Set Back Pressure Thresholds
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 05-28
Hands-On Exercise: Set Back Pressure Thresholds
▪ In this exercise, you will add back pressure to connections in a dataflow
▪ Please refer to the Hands-On Exercise Manual for instructions

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 05-29
Chapter Topics

Dataflows
▪ Command and Control of a Dataflow
▪ Processor Relationships
▪ Back Pressure
▪ Prioritizers
▪ Labels
▪ Hands-On Exercise: Build a More Complex Dataflow
▪ Hands-On Exercise: Creating a Fork Using Relationships
▪ Hands-On Exercise: Set Back Pressure Thresholds
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 05-30
Essential Points
▪ NiFi dataflows
─ NiFi uses Processors and Connections to create dataflows
─ Processors can be stateful or stateless
▪ Relationships
─ Every Processor in NiFi must produce at least one Relationship
─ Relationships determine where FlowFiles goes next
▪ Back pressure
─ Back pressure is a tool to manage large queues in Connections
─ Connections can communicate with the upstream Processor to pause
sending files until the queue has become smaller than a pre-defined
threshold
─ Back pressure thresholds are user-defined
▪ Priortizers
─ Prioritizers control which FlowFiles in a queue are processed first
─ Prioritizers are based on FlowFile time, arrival order, or importance of a
dataset
Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 05-31
Process Groups
Chapter 6
Course Chapters

▪ Introduction
▪ Introduction to Cloudera Flow Management
▪ Processors
▪ Connections
▪ Dataflows
▪ Process Groups
▪ FlowFile Provenance
▪ Dataflow Templates
▪ Apache NiFi Registry
▪ FlowFile Attributes
▪ NiFi Expression Language
▪ NiFi Architecture
▪ Dataflow Optimization
▪ Site-to-Site Dataflows
▪ Cloudera Edge Management and MiNiFi
▪ Monitoring and Reporting
▪ Controller Services
▪ Integrating NiFi with the Cloudera Ecosystem
▪ NiFi Security
▪ Conclusion

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-2
Process Groups
After completing this chapter, you will be able to
▪ Use a Process Group to organize your Processors
▪ Interpret Process Group indicators
▪ Use input and output ports to move FlowFiles in and out of Process Groups

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-3
Chapter Topics

Process Groups
▪ Anatomy of Process Group
▪ Input and Output Ports
▪ Hands-On Exercise: Simplify Dataflows Using Process Groups
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-4
Uses of a Process Group
▪ Organize and simplify the canvas
▪ Subdivide canvas by user groups, use case, and dataflows
▪ Manage access control at the Process Group level
▪ Allow version control using NiFi Registry
▪ Can encapsulate further processing using inbound and/or outbound
Connections

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-5
Elements of a Process Group
▪ Process Groups contain a set of Processors and their Connections
▪ Can comprehensively start/start all contained Processors with a single action
▪ Process Groups can be nested
─ A group can contain one or more subgroups
▪ Input and output ports pass FlowFiles into and out of a Process Group

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-6
Anatomy of a Process Group
▪ Process Groups provide a mechanism for grouping components together into a
logical construct
 
Process Group name Active Tasks

Component Bulletin
counts indicator

5-Minute
Statistics

Registry versioning status

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-7
Name and Bulletin Indicator
▪ Name
─ User-defined name of the Process Group
─ Set when the Process Group is added to the canvas
─ In this example, the name of the Process Group is Save Log File
▪ Bulletin Indicator
─ Components in a Process Group propagate bulletins to the parent group
─ Bulletin indicator indicates when any component has an active bulletin
─ Hover pointer over the icon to see active bulletins
 
Process Group name

Bulletin
indicator

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-8
Component Counts
▪ How many components of each type exists within the Process Group
─ Number of running Processors and ports
─ Number of Processors and ports not currently running

Number of enabled Processors and ports in invalid state

Number of disabled Processors and ports
 

Component
counts

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-9
5-Minute Statistics
▪ Process Group statistics include
─ Queued: number of FlowFiles currently enqueued within the Process Group
─ In: number of FlowFiles passed into the Process Group over the past five
minutes
─ Read/Write: total size of FlowFile content read from and written to disk in
the last five minutes
─ Out: number of FlowFiles that have been passed out over the past five
minutes
 

5-Minute
Statistics

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-10
Active Tasks
▪ Active Tasks
─ The number of tasks that are currently being executed within this group
─ In this example the Process Group is currently performing two tasks
 
Active Tasks

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-11
Chapter Topics

Process Groups
▪ Anatomy of Process Group
▪ Input and Output Ports
▪ Hands-On Exercise: Simplify Dataflows Using Process Groups
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-12
Adding and Configuring Input Ports
▪ Input ports provide a mechanism for passing FlowFiles into a Process Group
▪ All ports within a Process Group must have unique names
▪ In a secure NiFi system, you can configure ports to restrict access to
appropriate users
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-13
Adding and Configuring Output Ports
▪ Output ports provide a mechanism for passing FlowFiles out of a Process
Group
▪ All ports within a Process Group must have unique names
▪ To run securely you can configure ports to restrict access to appropriate users
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-14
Chapter Topics

Process Groups
▪ Anatomy of Process Group
▪ Input and Output Ports
▪ Hands-On Exercise: Simplify Dataflows Using Process Groups
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-15
Hands-On Exercise: Simplify Dataflows Using Process Groups
▪ In this exercise, you will simplify dataflows by moving individual dataflows
into process groups
▪ Please refer to the Hands-On Exercise Manual for instructions

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-16
Chapter Topics

Process Groups
▪ Anatomy of Process Group
▪ Input and Output Ports
▪ Hands-On Exercise: Simplify Dataflows Using Process Groups
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-17
Essential Points
▪ Process Groups
─ Process Groups are a mechanism in Apache NiFi for organizing dataflows
─ A Process Group allows you to group a set of Processors on their own
embedded canvas
▪ Process Group surface panel
─ The Process Group surface panel summarizes the performance of the
Process Group
─ It includes elements like Process Group name, Active Tasks, Bulletin
indicator, component counts, 5 minute-statistics, and registry versioning
status
▪ Input and Output Ports
─ NiFi uses input and output ports to move FlowFiles into and out of Process
Groups
─ Can used to link Process Groups to a larger dataflow
─ Each port must have a unique name

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 06-18
FlowFile Provenance
Chapter 7
Course Chapters

▪ Introduction
▪ Introduction to Cloudera Flow Management
▪ Processors
▪ Connections
▪ Dataflows
▪ Process Groups
▪ FlowFile Provenance
▪ Dataflow Templates
▪ Apache NiFi Registry
▪ FlowFile Attributes
▪ NiFi Expression Language
▪ NiFi Architecture
▪ Dataflow Optimization
▪ Site-to-Site Dataflows
▪ Cloudera Edge Management and MiNiFi
▪ Monitoring and Reporting
▪ Controller Services
▪ Integrating NiFi with the Cloudera Ecosystem
▪ NiFi Security
▪ Conclusion

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 07-2
FlowFile Provenance
After completing this chapter, you will be able to
▪ Explain how the provenance event panel presents the lifecycle of FlowFiles
within the dataflow
▪ Interpret the FlowFile lineage
▪ Use the FlowFile replay to help you build a successful dataflow

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 07-3
Chapter Topics

FlowFile Provenance
▪ Data Provenance Events
▪ FlowFile Lineage
▪ Replaying a FlowFile
▪ Hands-On Exercise: Using Data Provenance
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 07-4
FlowFile Provenance
▪ Data provenance provides a way to determine what happened to a particular
FlowFile
▪ NiFi keeps a fine-grained level of detail about each piece of data that it ingests
▪ Data provenance events are recorded in the provenance repository
▪ Use provenance to find, troubleshoot, and evaluate things like dataflow
compliance and optimization in real time
▪ By default, NiFi updates this information every five minutes (configurable)

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 07-5
How NiFi Handles Data Provenance
▪ Tracks FlowFiles as they flow through the system
▪ Records, indexes, displays and visualizes FlowFiles life through the dataflow
▪ Handles fan-in/fan-out (merging and splitting data)
▪ Displays attributes and content when various events occurred within a
dataflow
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 07-6
Data Provenance Events
▪ Each point in a dataflow where a FlowFile is processed in some way is
considered a processing event
▪ Various types of processing events occur, depending on the dataflow design
▪ Key types include
─ RECEIVE—FlowFile is brought into the flow
─ SEND—FlowFile is sent out of the flow
─ CLONE—FlowFile is cloned
─ ROUTE—FlowFile is routed
─ CONTENT_MODIFIED or ATTRIBUTES_MODIFIED—content or attribute
of a FlowFile is changed
─ FORK or JOIN—a FlowFile is split or combined with other FlowFiles
─ DROP—FlowFile is removed from the flow
─ FETCH—an existing FlowFile’s contents are modified as a result of obtaining
data from an external resource

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 07-7
Global Menu: Data Provenance
FlowFile
Lineage
Event Type Graph

Provenance
Event
Details

Go-To
Arrow

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 07-8
Provenance Event Panel
▪ In the far-left column of the Data Provenance page, there is a “View Details”
icon for each event:
▪ Clicking this button opens a dialog window with three tabs: DETAILS,
ATTRIBUTES, and CONTENT
 
Details Content
Tab Tab

Attributes
Tab

Download/View content

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 07-9
Searching for Events

▪ To search for a FlowFile,


click the Search icon in the
Data Provenance page
▪ Enter the parameters to
define the search
▪ For example, to determine
if a particular FlowFile was
received, search
─ Event Type of RECEIVE
─ FlowFile with "ABC"
anywhere in its filename
─ Received at any time on
July 28, 2016
▪ Use the asterisk (*) as a wildcard

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 07-10
Chapter Topics

FlowFile Provenance
▪ Data Provenance Events
▪ FlowFile Lineage
▪ Replaying a FlowFile
▪ Hands-On Exercise: Using Data Provenance
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 07-11
Viewing FlowFile Lineage
▪ The Data Provenance page lets you view the lineage or path a FlowFile took
─ Use the “Show Lineage” icon to open graph showing the processing events
▪ The selected event will be highlighted in red
▪ Hover pointer over the FlowFile icon to highlight the entire lineage path
 
Pop out

FlowFile

Event whose graph Return to


Event List
was selected

Move the slider to see the


evolution of the lineage

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 07-12
FlowFile Lineage: Find Parents
▪ When a FORK or CLONE event occurs, NiFi keeps track of the parent FlowFile
that produced other FlowFiles
▪ Use the lineage graph to find a parent FlowFile
▪ Right-click on the event and select Find parents
▪ Displays up to 100 child FlowFiles
─ Use View Details to display UUID of all child FlowFiles
 
Click on "Find parents"

The graph is re-drawn


with parent FlowFile and
its lineage

Parent

Child

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 07-13
FlowFile Lineage: Expanding an Event
▪ You can also determine what children were spawned from a given FlowFile
▪ Right-click on the event in the lineage graph and select Expand from the
context menu

Click Graph is re-drawn to shown


"Expand" children and their lineage

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 07-14
Chapter Topics

FlowFile Provenance
▪ Data Provenance Events
▪ FlowFile Lineage
▪ Replaying a FlowFile
▪ Hands-On Exercise: Using Data Provenance
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 07-15
Replaying a FlowFile
▪ You can inspect a FlowFile’s content as it was when an event occurred
▪ You can make adjustments to the dataflow and replay the FlowFile again
─ Go to the Content tab of the View Details dialog window
▪ Click REPLAY to replay the FlowFile at this point in the flow
─ The FlowFile is sent to the Connection feeding the component that produced
this processing event
▪ A FlowFile can only be replayed if its content is still available in content
repository
 

Replay the
FlowFile

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 07-16
Chapter Topics

FlowFile Provenance
▪ Data Provenance Events
▪ FlowFile Lineage
▪ Replaying a FlowFile
▪ Hands-On Exercise: Using Data Provenance
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 07-17
Hands-On Exercise: Using Data Provenance
▪ In this exercise, you will view FlowFile provenance
▪ Please refer to the Hands-On Exercise Manual for instructions

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 07-18
Chapter Topics

FlowFile Provenance
▪ Data Provenance Events
▪ FlowFile Lineage
▪ Replaying a FlowFile
▪ Hands-On Exercise: Using Data Provenance
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 07-19
Essential Points
▪ Data Proveance is is a historical record of how FlowFiles have passed through
your dataflow that can be used for debugging and optimization
▪ FlowFile Lineage is a graphical representation of all the events that have been
triggered by a FlowFile passing through your dataflow
▪ You can use the FlowFile replay feature to re-run FlowFlies at specific events
to help understand your flow and test bug fixes

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 07-20
Dataflow Templates
Chapter 8
Course Chapters

▪ Introduction
▪ Introduction to Cloudera Flow Management
▪ Processors
▪ Connections
▪ Dataflows
▪ Process Groups
▪ FlowFile Provenance
▪ Dataflow Templates
▪ Apache NiFi Registry
▪ FlowFile Attributes
▪ NiFi Expression Language
▪ NiFi Architecture
▪ Dataflow Optimization
▪ Site-to-Site Dataflows
▪ Cloudera Edge Management and MiNiFi
▪ Monitoring and Reporting
▪ Controller Services
▪ Integrating NiFi with the Cloudera Ecosystem
▪ NiFi Security
▪ Conclusion

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-2
Dataflow Templates
After completing this chapter, you will be able to
▪ Describe the situations in which templates should be used
▪ Create a template in your Apache NiFi canvas
▪ Manage templates to export and import a template

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-3
Chapter Topics

Dataflow Templates
▪ Templates Overview
▪ Managing Templates
▪ Hands-On Exercise: Creating, Using, and Managing Templates
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-4
Templates Overview
▪ Processors, Connections, Funnels, and so on are the basic building blocks for
constructing a dataflow
▪ Using small building blocks can become tedious if the same logic gets repeated
several times
▪ A template is a way of combining basic building blocks into larger building
blocks
▪ Once a dataflow has been created, parts of it can be selected and saved as a
template

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-5
Templates Overview (2)
▪ A template can be downloaded as an XML file to be shared with others
─ The template can be uploaded into a separate instance of NiFi
▪ Use templates to
─ Reuse a useful portion of a dataflow in other dataflows
─ Transfer dataflows from one environment to another

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-6
Chapter Topics

Dataflow Templates
▪ Templates Overview
▪ Managing Templates
▪ Hands-On Exercise: Creating, Using, and Managing Templates
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-7
Creating a Template
▪ Select the components to include in the template
▪ Click the “Create Template” icon from Operate palette
▪ Provide a name and optionally comments about the template
 

Select the components


you want, then click
the create template icon

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-8
Uploading a Template
▪ Upload a template from the canvas Operate palette
▪ Uploaded templates are available for selection when you drag the template
icon from the toolbar
 
 

Upload template

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-9
Adding a Template to the Canvas
▪ Add a template to the canvas by dragging and dropping the Template icon
from the toolbar
 
Drag and drop onto
the canvas

Choose a template

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-10
Managing Templates
▪ One of the most powerful features of templates is the ability to easily export
templates
─ Saved as XML files
─ Provides a simple mechanism for sharing parts of a dataflow with others
▪ You can
─ Upload a template
─ Download a template
─ Remove a template

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-11
Downloading or Removing a Template
▪ Use the global menu Templates option to show a list of existing templates
▪ Provides icons to
─ Save a template as an XML file
─ Delete a template
 

Export to XML

Remove

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-12
Chapter Topics

Dataflow Templates
▪ Templates Overview
▪ Managing Templates
▪ Hands-On Exercise: Creating, Using, and Managing Templates
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-13
Hands-On Exercise: Creating, Using, and Managing Templates
▪ In this exercise, you will create, use, and manage templates
▪ Please refer to the Hands-On Exercise Manual for instructions

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-14
Chapter Topics

Dataflow Templates
▪ Templates Overview
▪ Managing Templates
▪ Hands-On Exercise: Creating, Using, and Managing Templates
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-15
Essential Points
▪ Templates in NiFi are used to combine basic building blocks into larger
building blocks that can be reused over and over
▪ Templates can be used for sharing dataflows between different instance of
NiFi
▪ To create template, select the components you want to reuse and click on the
create template icon in the Operate panel
▪ To import a template, select “Upload template” icon the Operate panel
▪ Use the template menu in the NiFi global menu to export and delete
templates

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 08-16
Apache NiFi Registry
Chapter 9
Course Chapters

▪ Introduction
▪ Introduction to Cloudera Flow Management
▪ Processors
▪ Connections
▪ Dataflows
▪ Process Groups
▪ FlowFile Provenance
▪ Dataflow Templates
▪ Apache NiFi Registry
▪ FlowFile Attributes
▪ NiFi Expression Language
▪ NiFi Architecture
▪ Dataflow Optimization
▪ Site-to-Site Dataflows
▪ Cloudera Edge Management and MiNiFi
▪ Monitoring and Reporting
▪ Controller Services
▪ Integrating NiFi with the Cloudera Ecosystem
▪ NiFi Security
▪ Conclusion

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-2
Apache NiFi Registry
After completing this chapter, you will be able to
▪ Describe the benefits of using Apache NiFi Registry
▪ Use a registry to track versions of a Process Group

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-3
Chapter Topics

Apache NiFi Registry


▪ Apache NiFi Registry Overview
▪ Using the Registry
▪ Hands-On Exercise: Versioning Flows Using NiFi Registry
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-4
Apache NiFi Registry
▪ Separate, complementary application to Apache NiFi
─ Part of Cloudera Flow Management (CFM)
▪ Centralized repository for versioning NiFi flows
─ Supports multiple registries and interactions between them
─ Can be integrated with enterprise version control such as Git
▪ Manages the flow development lifecycle with features including
─ Notifications
─ Release management
▪ Enables collaboration by sharing registry repositories between multiple
developers
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-5
Key NiFi Registry Concepts
▪ Flow—a dataflow within a Process Group
─ Versioning can only be applied to Process Groups
▪ Registry Client—connects a NiFi instance to a NiFi Registry instance
─ One NiFi instance can connect to any number of registry clients
─ Multiple NiFi instances can connect to the same registry clients—allows
developers to work on the same flows
▪ Buckets—store and organize flow versions
─ Each flow is assigned to one bucket
─ One bucket holds any number of related flows
─ Can provide security through user access control

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-6
Flow Versioning
▪ Flows are versioned at the Process Group level
▪ Make changes to the flow in a versioned Process Group locally
▪ Use the NiFi UI to
─ Commit local changes to the repository
─ Creates a new version
─ Review differences between the local version and the last committed version
─ Revert local changes and restore flow to the last version
─ Roll back the flow to an earlier version previously committed to the
repository

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-7
Chapter Topics

Apache NiFi Registry


▪ Apache NiFi Registry Overview
▪ Using the Registry
▪ Hands-On Exercise: Versioning Flows Using NiFi Registry
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-8
Add a NiFi Client
▪ Before you can start versioning your flows, you must add one or more NiFi
clients for your NiFi instance
▪ Use the Registry Clients tab under Global Menu > Controller Settings
▪ Add the URL for the NiFi Registry instance you want to use
 
 

Create new registry client

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-9
NiFi Registry UI: Managing Buckets
▪ Create, edit, and delete buckets on the NiFi Registry UI Administration page
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-10
The NiFi Registry UI: Viewing Buckets
▪ The main NiFi Registry UI page lets you search for buckets
▪ Open a bucket to see a list of versioned flows in that bucket
▪ Open a flow to show and manage versions
 
Administration
Bucket selector Flow version details Settings

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-11
Adding Versioning to a Flow (1)
▪ Flows do not use NiFi Registry by default
▪ Add versioning to a flow in the main NiFi UI
─ Use the Process Group’s context menu Start Version Control option
─ Note that flows must be in a Process Group to be versioned
 
Process Group Context Menu

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-12
Adding Versioning to a Flow (2)
▪ When prompted, configure the flow’s versioning with
─ Registry client
─ Bucket
─ Flow name and description
─ Comments for the initial version
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-13
Working with Versioned Flows (1)
Versioned flow is up-to-date in the
repository

You have modified a flow but have


not committed your changes to the
repository

The registry contains newer version


of this process group

Failed to synchronize with registry

The registry contains a newer version


but local changes exist which have
not been added to the repository

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-14
Working with Versioned Flows (2)
▪ Use the Process Group’s context menu to revert or commit local changes,
review changes, or roll back to a prior version
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-15
Chapter Topics

Apache NiFi Registry


▪ Apache NiFi Registry Overview
▪ Using the Registry
▪ Hands-On Exercise: Versioning Flows Using NiFi Registry
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-16
Hands-On Exercise: Versioning Flows Using NiFi Registry
▪ In this exercise, you will add versioning to a process group and explore
versioning options
▪ Please refer to the Hands-On Exercise Manual for instructions

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-17
Chapter Topics

Apache NiFi Registry


▪ Apache NiFi Registry Overview
▪ Using the Registry
▪ Hands-On Exercise: Versioning Flows Using NiFi Registry
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-18
Essential Points
▪ Apache NiFi Registry is a centralized repository for versioning NiFi dataflows
▪ It is useful for collaborating on and sharing dataflows and provides a
mechanism for tracking different versions of dataflows
▪ You must connect a NiFi Registry instance to your NiFi instance before
beginning versioning
▪ To add a dataflow to NiFi Registry, it must be in a Process Group
▪ To add a Process Group to a connected NiFi Registry, right-click on the Process
Group and select Version followed by Start version control

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 09-19
FlowFile Attributes
Chapter 10
Course Chapters

▪ Introduction
▪ Introduction to Cloudera Flow Management
▪ Processors
▪ Connections
▪ Dataflows
▪ Process Groups
▪ FlowFile Provenance
▪ Dataflow Templates
▪ Apache NiFi Registry
▪ FlowFile Attributes
▪ NiFi Expression Language
▪ NiFi Architecture
▪ Dataflow Optimization
▪ Site-to-Site Dataflows
▪ Cloudera Edge Management and MiNiFi
▪ Monitoring and Reporting
▪ Controller Services
▪ Integrating NiFi with the Cloudera Ecosystem
▪ NiFi Security
▪ Conclusion

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 10-2
Chapter Topics

FlowFile Attributes
▪ FlowFile Attribute Overview
▪ Routing on Attributes
▪ Hands-On Exercise: Working with FlowFile Attributes
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 10-3
FlowFile Attributes
After completing this chapter, you will be able to
▪ Describe several uses for FlowFile attributes
▪ Modify an attribute using the Update Attribute Processor
▪ Demonstrate how FlowFiles can be routed depending on their attributes

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 10-4
What are FlowFile Attributes?

▪ Each FlowFile is created with several attributes Attributes


─ Values can change as the FlowFile moves through a flow
filename : ...
path : ...

UUID : ...
All FlowFiles have a default set of attributes .
─ NiFi assigns values automatically Content
▪ You can define and set new attributes *claim
─ Attributes and values can be set and changed within the
flow

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 10-5
FlowFile Attribute Uses

Routing Processing

Storing Sending
information values to
about data subsequent
processors

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 10-6
Default Attributes (1)
 

filename lineageStartDate

path entryDate

UUID fileSize

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 10-7
Default Attributes (2)
▪ All default attributes are automatically assigned initial values
─ filename and path values can be overridden manually within the flow
─ Other default attributes are set by the system and cannot be changed by the
user

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 10-8
User-defined Attributes
▪ You can add user-defined attributes to FlowFiles using the
UpdateAttribute Processor
─ Add a “dynamic” configuration property to the Processor
─ An attribute will be added to the FlowFile with the same name as the
Processor property name
▪ A few other Processors also allow using dynamic properties to set attributes
─ Such as GenerateFlowFile for testing
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 10-9
Setting Attribute Values
▪ Set new dynamic properties in the Properties tab
─ The same as setting any other property
▪ Properties can be set to a literal value or to the result of a NiFi Expression
Language statement
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 10-10
Viewing Attributes
▪ View attributes in ATTRIBUTES tab on a FlowFile
─ Open FlowFile in provenance or Connection queue
▪ View all attributes or only changed ones
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 10-11
Chapter Topics

FlowFile Attributes
▪ FlowFile Attribute Overview
▪ Routing on Attributes
▪ Hands-On Exercise: Working with FlowFile Attributes
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 10-12
Extracting Attributes
▪ NiFi provides several different out-of-the-box Processors for extracting
attributes from FlowFiles
▪ This is a very common use case for building custom Processors, as well, such as
─ To understand a specific data format
─ Extract pertinent information from a FlowFile’s content
─ Creating attributes to hold that information
─ Decisions can then be made about how to route or process the data

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 10-13
Routing on Attributes
▪ NiFi can route a FlowFile to different Processors based on the FlowFile’s
attributes
▪ UpdateAttribute and RouteOnAttribute Processors help in adding
and routing based on an attribute
▪ The Processor evaluates a property based on a NiFi Expression Language
expression that returns a boolean
─ Compares to a property of the configured routing strategy
▪ The most common is the “Route to Property name” strategy
─ The Processor will expose a relationship for each property configured
─ If the attributes satisfy the expression, a child copy of the FlowFile will be
routed to corresponding relationship

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 10-14
Chapter Topics

FlowFile Attributes
▪ FlowFile Attribute Overview
▪ Routing on Attributes
▪ Hands-On Exercise: Working with FlowFile Attributes
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 10-15
Hands-On Exercise: Working with FlowFile Attributes
▪ In this exercise, you will practice working with FlowFile attributes
▪ Please refer to the Hands-On Exercise Manual for instructions

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 10-16
Chapter Topics

FlowFile Attributes
▪ FlowFile Attribute Overview
▪ Routing on Attributes
▪ Hands-On Exercise: Working with FlowFile Attributes
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 10-17
Essential Points
▪ FlowFile are composed of two sections, a data and an attributes section
▪ Attributes are FlowFile metadata, made of key-value pairs
▪ FlowFile attributes can be used for routing, processing, storing information,
and sending values to subsequent Processors
▪ NiFi uses a set of default attributes for each FlowFile. You can add custom
attributes using the UpdateAttribute Processor
▪ You can use the RouteOnAttribute to route a FlowFile

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 10-18
NiFi Expression Language
Chapter 11
Course Chapters

▪ Introduction
▪ Introduction to Cloudera Flow Management
▪ Processors
▪ Connections
▪ Dataflows
▪ Process Groups
▪ FlowFile Provenance
▪ Dataflow Templates
▪ Apache NiFi Registry
▪ FlowFile Attributes
▪ NiFi Expression Language
▪ NiFi Architecture
▪ Dataflow Optimization
▪ Site-to-Site Dataflows
▪ Cloudera Edge Management and MiNiFi
▪ Monitoring and Reporting
▪ Controller Services
▪ Integrating NiFi with the Cloudera Ecosystem
▪ NiFi Security
▪ Conclusion

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-2
Chapter Topics

NiFi Expression Language


▪ NiFi Expression Language Overview
▪ Syntax
▪ Expression Language Editor
▪ Setting Conditional Values
▪ Hands-On Exercise: Using the NiFi Expression Language
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-3
NiFi Expression Language
After completing this chapter, you will be able to
▪ Use the language editor to create a simple expression that changes a FlowFile
attribute
▪ Use the Apache NiFi expression language to change a FlowFile attribute based
on certain conditions and rules

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-4
NiFi Expression Language
▪ NiFi Expression Language (NEL) is a flexible, consistent mechanism for
manipulating FlowFile attributes
▪ Provides access to system environment variables and JVM properties
▪ Can be used to specify supported Processor property values

Hover over the information icon ( ) in the Processor configuration
Properties tab to see details about using NEL for that property
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-5
NIFI Expression Language functions
▪ The NiFi Expression Language provides numerous useful functions to
manipulate attribute values
▪ A few examples
─ Boolean logic: equals, gt (greater than), lt (less than)
─ String manipulation: substring, toUpper, replace
─ Numeric functions: plus, minus, multiply
▪ The full list of functions is documented in the Apache NiFi Expression
Language Guide

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-6
Chapter Topics

NiFi Expression Language


▪ NiFi Expression Language Overview
▪ Syntax
▪ Expression Language Editor
▪ Setting Conditional Values
▪ Hands-On Exercise: Using the NiFi Expression Language
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-7
Using Expression Language Statements
▪ Expressions are enclosed between open and close curly braces, prefixed by a
dollar sign ${...}
▪ The expression language reserves several “special” characters
─ Tabs, newlines, and spaces
─ ( ) $ | { } [ ] , * ; / : ; " '
─ To use a special character, you must quote it

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-8
Using FlowFile Attributes in Expressions
▪ Expressions are denoted by ${...}
▪ Attribute references will return the attribute’s value
▪ Example: Return the value of the FlowFile’s filename attribute
─ Subject attribute: filename = my-file.tgz
─ Result: my-file.tgz
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-9
Setting an Attribute to an Expression Return Value
▪ Assign an attribute value by setting a property on an UpdateAttribute
Processor
▪ Example: Store the original filename in a new property called Original
Filename
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-10
Using Attributes with Text Strings (1)
▪ Expressions can be used to generate new strings concatenating them with
literal text
▪ Example: Return the existing filename attribute value with the text
.packaged appended
─ Subject attribute: filename = my-file.tgz
─ Result: my-file.tgz.packaged
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-11
Using Attributes with Text Strings (2)
▪ Example (continued): Use the expression to set a new filename attribute
based on the existing one
─ Add filename property to an UpdateAttribute Processor
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-12
Expression Language Function Calls (1)
▪ Example: Return the existing filename attribute, replacing the string .tgz
with string .tar.gz
─ Subject attribute: filename = my-file.tgz
─ Result: my-file.tar.gz
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-13
Expression Language Function Calls (2)
▪ Example (continued): Use the expression to set a new filename attribute
based on the existing one
─ Add filename property to UpdateAttribute Processor
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-14
Combining Multiple Expressions (1)
▪ A single property setting can embed multiple expressions in a string
▪ Example: Include the UUID of a FlowFile in the filename
─ Expression 1:
─ Subject attribute: filename = system.log, result: system
─ Expression 2:
─ Subject attribute: uuid = abc123, result: abc123
─ Final result: system_abc123.log
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-15
Combining Multiple Expressions (2)
▪ Example (continued): Include the UUID of a FlowFile in the filename
─ Add filename property to UpdateAttribute
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-16
Embedding Expressions in Function Calls (1)
▪ Expression results can be passed as arguments to functions
▪ Example: Test if the FlowFile has changed so far in the flow
─ Compare lineageStartDate to entryDate—if they match, the
current FlowFile is unchanged
─ Outer expression subject attribute: lineageStartDate =
1567000319621
─ Inner expression subject attribute: entryDate = 1567000319621
─ Result: true (boolean)
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-17
Embedding Expressions in Function Calls (2)
▪ Boolean expressions can be use for routing
─ Add routing property to RouteOnAttribute Processor
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-18
Chaining Functions (1)
▪ Multiple functions can be used in the same expression
─ Separated by a colon (:)
▪ Expressions language statements are generally processed left to right
─ Functions manipulate the value returned by the previous function
▪ Example: Extract a date string embedded in a filename
─ Subject attribute: filename = system-20190823.log
─ substringBeforeLast(".log") returns system-20190823
─ substringAfterLast("-") returns 20190823

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-19
Chaining Functions (2)
▪ Example (continued): Set a new attribute called logDate to the extracted
date string embedded in a filename
─ Add logDate property to an UpdateAttribute Processor
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-20
Chaining Functions (3)
▪ Example (continued): Extract day, month, and year from extracted date and
set multiple attributes
─ Define and set three new properties on a UpdateAttribute Processor
─ logYear = ${filename:substringBeforeLast(".log"):
substringAfterLast("-"):substring("0","4")}
─ logMonth = ${filename:substringBeforeLast(".log"):
substringAfterLast("-"):substring("4","6")}
─ logDay = ${filename:substringBeforeLast(".log"):
substringAfterLast("-"):substring("6","8")}

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-21
Expression Language Syntax Summary
▪ Expressions are denoted by open and close curly braces prefixed by a dollar
sign ${...}
▪ In its most basic form, a expression can consist of just a FlowFile attribute
name
─ For example, ${filename} returns the value of the filename attribute
▪ Use : to call functions to manipulate attribute values
─ For example, ${filename:toUpper()} returns the value of the
filename attribute converted to upper case
▪ Any number of functions can be chained using :
─ Chained functions manipulate the value returned by the previous function in
the chain
─ For example, ${filename:toUpper():equals('HELLO.TXT')}
returns true if the file name is HELLO.TXT

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-22
Chapter Topics

NiFi Expression Language


▪ NiFi Expression Language Overview
▪ Syntax
▪ Expression Language Editor
▪ Setting Conditional Values
▪ Hands-On Exercise: Using the NiFi Expression Language
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-23
Expression Language Editor Features (1)
▪ Structure highlighting—When an open curly bracket, square bracket, or
parenthesis is selected, the corresponding closing bracket or parenthesis is
highlighted

▪ Syntax color coding—Different elements of the statement are color coded


(subjects, functions, function inputs, and so on)

▪ If invalid syntax is detected, the color coding is removed

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-24
Expression Language Editor Features (2)
▪ Multi-line support
─ A statement can be written across multiple lines for readability
─ Enter Shift+Enter
▪ Comments
─ Start comments by using a hash symbol (#) anywhere on a line
─ Comments end at the end of the line
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-25
Expression Language Editor Features (3)
▪ Auto-complete
─ Enter CTRL+SPACEBAR to see a list of applicable functions
─ Select an item in the list to see a pop-up with documentation for the
function
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-26
Chapter Topics

NiFi Expression Language


▪ NiFi Expression Language Overview
▪ Syntax
▪ Expression Language Editor
▪ Setting Conditional Values
▪ Hands-On Exercise: Using the NiFi Expression Language
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-27
Setting Values Conditionally
▪ NiFi Expression Language has limited support for if/then statements
▪ Specify condition rules on an UpdateAttribute Processor for more robust
conditional handling
▪ Use the ADVANCED button on the UpdateAttributePROPERTIES
configuration tab
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-28
Rules-based Attribute Settings (1)
▪ Each rule has a defined condition expressions(s) and corresponding action(s)
─ Condition expressions must evaluate to boolean values (true or false)
─ If all conditions are true, the rule’s action(s) are applied
─ Actions set a specified attribute’s value
 
Select rule

Conditional
Action attributes expressions
and values

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-29
Rules-based Attribute Settings (2)
▪ You can create as many rules has you want
─ Rules are evaluated in the order they are listed, top to bottom
─ Drag and drop to change order
▪ If there are no rules to set an attribute, NiFi will use the corresponding
Processor property

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-30
FlowFile Policies
▪ Choose one of two FlowFile policies for how to set FlowFile attributes
─ Use clone (default)—if more than one rule’s conditions are true for a
FlowFile, a new FlowFile (clone) will be created for each rule
─ That is, if three rules match, three new FlowFiles will be created with the
different rules applied
─ Use original—all the rules will be applied to each FlowFile
─ Rules are applied in order, top to bottom

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-31
Chapter Topics

NiFi Expression Language


▪ NiFi Expression Language Overview
▪ Syntax
▪ Expression Language Editor
▪ Setting Conditional Values
▪ Hands-On Exercise: Using the NiFi Expression Language
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-32
Hands-On Exercise: Using the NiFi Expression Language
▪ In this exercise, you will configure a processor using the NiFi Expression
Language to set FlowFile attributes
▪ Please refer to the Hands-On Exercise Manual for instructions

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-33
Chapter Topics

NiFi Expression Language


▪ NiFi Expression Language Overview
▪ Syntax
▪ Expression Language Editor
▪ Setting Conditional Values
▪ Hands-On Exercise: Using the NiFi Expression Language
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-34
Essential Points
▪ NiFi Expression Language (NEL) is a tool for dynamically manipulating or
changing FlowFile attributes in your dataflow
▪ NEL is only available in selected Processors
─ The most common is UpdateAttribute
▪ NEL has a large number of functions including boolean operators, string
manipulation, and numeric functions
▪ The Expression Language Editor helps you to create expressions with features
such structure highlighting, syntax color coding, multi-line support, comments,
and auto-complete
▪ Using the ADVANCED page, you can create more complex conditional
expressions

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 11-35
NiFi Architecture
Chapter 12
Course Chapters

▪ Introduction
▪ Introduction to Cloudera Flow Management
▪ Processors
▪ Connections
▪ Dataflows
▪ Process Groups
▪ FlowFile Provenance
▪ Dataflow Templates
▪ Apache NiFi Registry
▪ FlowFile Attributes
▪ NiFi Expression Language
▪ NiFi Architecture
▪ Dataflow Optimization
▪ Site-to-Site Dataflows
▪ Cloudera Edge Management and MiNiFi
▪ Monitoring and Reporting
▪ Controller Services
▪ Integrating NiFi with the Cloudera Ecosystem
▪ NiFi Security
▪ Conclusion

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-2
Chapter Topics

NiFi Architecture
▪ NiFi Architecture Overview
▪ Cluster Architecture
▪ Heartbeats
▪ Managing Clusters
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-3
NiFi Architecture
After completing this chapter, you will be able to
▪ Describe the architecture on which Apache NiFi is deployed
▪ Use the NiFi interface to manage clusters
▪ Describe how NiFi uses heartbeats as a health status

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-4
NiFi Architecture Overview
▪ NiFi can run on a single node—standalone mode—or multiple nodes—called a
cluster
▪ Individual nodes have the same basic architecture in both modes
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-5
Primary Components in the JVM
▪ NiFi executes within a JVM running on a host
▪ The primary components of NiFi running in the JVM are
─ Web Server—hosts NiFi’s HTTP-based command and control API
─ Flow Controller—manages threads and schedules execution and resources
─ The “brains” of the the operation
─ Extensions such as custom Processors and NiFi plugins

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-6
Primary Storage Components
▪ FlowFile Repository—where NiFi keeps track of the state of active FlowFiles
─ The default approach is a persistent Write-Ahead Log on a specified disk
partition
▪ Content Repository—where the actual contents of FlowFiles are stored
─ The default approach stores blocks of data a file system supporting multiple
locations in different physical volumes for performance
▪ Provenance Repository—where provenance event data is stored
─ By default, located on one or more physical disk volumes
─ Event data is indexed and searchable
▪ flow.xml.gz—contains information about everything on the canvas

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-7
Chapter Topics

NiFi Architecture
▪ NiFi Architecture Overview
▪ Cluster Architecture
▪ Heartbeats
▪ Managing Clusters
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-8
Why Use NiFi in a Cluster?
▪ Physical resource exhaustion can occur even with an optimized dataflow
─ One instance of NiFi on a single server might not be enough to process all
required data
▪ Installing NiFi in a cluster solves this problem
─ Spreads the data load across multiple NiFi instances
▪ Nifi provides a single interface to
─ Make dataflow changes and replicate them throughout the cluster
─ Monitor all dataflows running across the cluster

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-9
NiFi Clustering Architecture Overview
▪ A cluster is a set of “nodes”—separate NiFi instances working together to
process data
▪ Each node in the cluster performs the same tasks on a different datasets
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-10
Cluster Nodes
▪ The same dataflows run on all the nodes
─ By default, components in the flows run on every node
─ Processors can be configured to run on primary node only
▪ You can access the UI for any of the individual nodes in the cluster
▪ If a node is disconnected from the cluster, you cannot make changes to any of
the flows

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-11
Cluster Coordinator
▪ One of the nodes in the cluster is identified as the cluster coordinator
─ Every cluster has exactly one coordinator
─ Automatically elected using Apache Zookeeper
▪ You can change flows using the UI on any node in the cluster
▪ The coordinator automatically propagates flow changes on one node to all
other nodes
▪ The coordinator also manages which nodes are allowed in the cluster

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-12
Primary Node and Isolated Processors
▪ Every cluster includes one primary node
─ Elected using Zookeeper
─ Note: coordinator and primary nodes serve different functions
▪ Most Processors run on all nodes by default
─ But some Processors should run on a single node—an isolated Processor
─ Often required when communicating with external systems that do not scale
well
 

Choose
execution
mode

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-13
Isolated Processor Example: ListHDFS
▪ ListHDFS must be run on primary node only
 
Runs on primary node

Gets a list of files in a


HDFS directory

Pulls individual HDFS files


from the list. Runs on all
nodes - each node pulls
different files

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-14
Connection Load Balancing
▪ Configure Connections to balance load across clusters to improve throughput
─ Can use different balancing strategies
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-15
Chapter Topics

NiFi Architecture
▪ NiFi Architecture Overview
▪ Cluster Architecture
▪ Heartbeats
▪ Managing Clusters
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-16
Heartbeats (1)
▪ Nodes communicate their health and status to the cluster coordinator with
heartbeats
─ Tells the coordinator that they are still healthy and connected to the cluster
▪ By default, nodes emit heartbeats every five seconds
─ If a node does not heartbeat for 40 seconds, the coordinator disconnects the
node from the cluster
─ If the disconnected node sends a heartbeat later, the coordinator adds the
node back into the cluster
─ And the re-validates of the node’s flow
─ Disconnection and reconnection events are reported in the UI

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-17
Heartbeats (2)
▪ When a cluster coordinator is elected, it updates a well-known ZNode in
Apache ZooKeeper with its Connection
─ Other nodes use this to know where to send heartbeats
▪ Other nodes will not automatically pick up a disconnected node’s processes

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-18
Chapter Topics

NiFi Architecture
▪ NiFi Architecture Overview
▪ Cluster Architecture
▪ Heartbeats
▪ Managing Clusters
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-19
Cluster Management UI (1)
▪ On the main canvas page, view the number of nodes in the cluster
─ Show total count and currently connected count
 

Number of
connected nodes

Indicates Cluster
node is part management
of a cluster

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-20
Cluster Management UI (2)
▪ The NiFi Cluster window lets you manage the cluster
─ View cluster node details, disconnect and remove nodes, and so on
 

Connected node

Primary and Disconnected node


Nodes in cluster coordinator node

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-21
Removing Nodes
▪ When any node is disconnected, no changes can be made until the node
reconnects
─ Remove the node from the cluster entirely to allow continued editing
─ The node can rejoin the cluster when the node has been restarted

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-22
Chapter Topics

NiFi Architecture
▪ NiFi Architecture Overview
▪ Cluster Architecture
▪ Heartbeats
▪ Managing Clusters
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-23
Essential Points
▪ Apache NiFi executes within a JVM running on a host
▪ The primary components are a web server, flow controller, and extensions
such as custom Processors and plugins
▪ The primary storage components are the FlowFile repository, content
repository, provenance repository and the flow.xml.gz file
▪ NiFi uses a cluster architecture to help share the workload across multiple NiFi
instances
▪ The NiFi UI shows details about the connection status of the cluster and lets
you manage the cluster connection

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 12-24
Dataflow Optimization
Chapter 13
Course Chapters

▪ Introduction
▪ Introduction to Cloudera Flow Management
▪ Processors
▪ Connections
▪ Dataflows
▪ Process Groups
▪ FlowFile Provenance
▪ Dataflow Templates
▪ Apache NiFi Registry
▪ FlowFile Attributes
▪ NiFi Expression Language
▪ NiFi Architecture
▪ Dataflow Optimization
▪ Site-to-Site Dataflows
▪ Cloudera Edge Management and MiNiFi
▪ Monitoring and Reporting
▪ Controller Services
▪ Integrating NiFi with the Cloudera Ecosystem
▪ NiFi Security
▪ Conclusion

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-2
Chapter Topics

Dataflow Optimization
▪ Dataflow Optimization
▪ Control Rate
▪ Managing Compute
▪ Hands-On Exercise: Building an Optimized Dataflow
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-3
Dataflow Optimization
After completing this chapter, you will be able to
▪ Explain why controlling dataflow is necessary
▪ Improve NiFi performance by reducing Processors and combining connections
in a dataflow
▪ Manage FlowFile backlog using ControlRate Processors
▪ Describe the tools NiFi uses to manage compute resources to balance your
system

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-4
What is Dataflow Optimization?
▪ Dataflow optimization is not an exact science with precise rules
▪ Requires a balance among
─ System resources (memory, network, disk space, disk speed, and CPU)
─ Number and sizes of files
─ Types of Processors used
─ Size of dataflow
─ NiFi system configuration

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-5
Minimize Number of Processors
▪ Use as few Processors as possible
▪ Use one Processor instead of many of the same type
─ Use the attribute values to separate the data when different processing is
required
▪ Group common functionality into Process Groups where it makes sense

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-6
Example: Reducing a Complex Flow (1)
▪ Example: A complex flow with multiple Processors of the same type to
decompress and push to HDFS
 
Pull data Pull data Pull data Pull HTTP
from Kafka from X from Y data

Check if file is
compressed

Decompress

Push file to
HDFS

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-7
Example: Reducing a Complex Flow (2)
▪ Route all FlowFiles to the same decompression HDFS storage Processors
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-8
Using Funnels
▪ Funnels combine many connections into a single connection
▪ Useful for prioritizing and batching FlowFiles across multiple queues
▪ Can also improve readability
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-9
Chapter Topics

Dataflow Optimization
▪ Dataflow Optimization
▪ Control Rate
▪ Managing Compute
▪ Hands-On Exercise: Building an Optimized Dataflow
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-10
Restrict the Rate of Flow
▪ The
ControlRate
Processor ensures
that the backlog
of FlowFiles will
not overwhelm
Processors further
down the flow
path
▪ Can also be
combined with
back pressure
when needed

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-11
Configuring the ControlRate Processor
▪ Rate Control Criteria options: data rate, flowfile count,
attribute value
▪ Example: limit rate to ~50,000 bytes/minute
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-12
Chapter Topics

Dataflow Optimization
▪ Dataflow Optimization
▪ Control Rate
▪ Managing Compute
▪ Hands-On Exercise: Building an Optimized Dataflow
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-13
Concurrent Tasks
▪ Set Concurrent Tasks to control how many threads a Processor runs
▪ Increases the number of FlowFiles processed in parallel
─ At the cost of using more system resources
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-14
Balance Latency and Throughput
▪ Some Processors allow you to configure run duration to balance latency
against throughput
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-15
Understanding What Resources a Processor Uses
▪ It is important to understand the resources needed by each Processor
▪ Example: the CompressContent Processor uses one CPU per concurrent
task
─ So four files in the queue and four concurrent tasks means four CPUs in use
─ Reduce bottleneck by separating files by size
─ Configure small files with two threads, medium and large files with one
thread
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-16
Processor Status
▪ The Processor status can help identify trouble spots
─ Use the read/write and tasks/time statistics to find “hot spots”
▪ If there are many tasks but the amount of data traversing the Processor is
low, the Processor might be configured to run too often or with too many
concurrent tasks
▪ Few completed tasks along with high task time indicates that this Processor is
CPU intensive
▪ If the dataflow volume is high with a high number of completed threads and
task time, improve performance by increasing the run duration
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-17
Managing Backlog
▪ If there is a connection in the flow where FlowFiles are always backlogged,
processing delay may be unacceptable
▪ However, adding more concurrent tasks to the bottlenecked Processor can
lead to thread starvation in another part of the flow
▪ Identify the source of the problem
─ The Processor might be very CPU intensive
─ The files might be very large leading to expensive reads and writes for each
file
▪ If resources are not an issue, try adding more concurrent tasks
▪ If resources are an issue, you might need to redesign flow or spread the
workload across a cluster

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-18
Chapter Topics

Dataflow Optimization
▪ Dataflow Optimization
▪ Control Rate
▪ Managing Compute
▪ Hands-On Exercise: Building an Optimized Dataflow
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-19
Hands-On Exercise: Building an Optimized Dataflow
▪ In this exercise, you will build an optimized dataflow
▪ Please refer to the Hands-On Exercise Manual for instructions

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-20
Chapter Topics

Dataflow Optimization
▪ Dataflow Optimization
▪ Control Rate
▪ Managing Compute
▪ Hands-On Exercise: Building an Optimized Dataflow
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-21
Essential Points
▪ To optimize your dataflow in Apache NiFi, you must balance
─ System resources
─ Number and sizes of files
─ Types of Processors used
─ Size of dataflow
─ NiFi system configuration
▪ Some popular techniques include
─ Minimize the number of Processors
─ Use Funnels to combine many connections
─ Restrict the rate of flow
─ Manage concurrent tasks
─ Balance latency and throughput

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 13-22
Site-to-Site Dataflows
Chapter 14
Course Chapters

▪ Introduction
▪ Introduction to Cloudera Flow Management
▪ Processors
▪ Connections
▪ Dataflows
▪ Process Groups
▪ FlowFile Provenance
▪ Dataflow Templates
▪ Apache NiFi Registry
▪ FlowFile Attributes
▪ NiFi Expression Language
▪ NiFi Architecture
▪ Dataflow Optimization
▪ Site-to-Site Dataflows
▪ Cloudera Edge Management and MiNiFi
▪ Monitoring and Reporting
▪ Controller Services
▪ Integrating NiFi with the Cloudera Ecosystem
▪ NiFi Security
▪ Conclusion

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-2
Chapter Topics

Site-to-Site Dataflows
▪ Site-to-Site Theory
▪ Site-to-Site Architecture
▪ Anatomy of a Remote Process Group
▪ Adding and Configuring Remote Process Groups
▪ Hands-On Exercise: Building Site-to-Site Dataflows
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-3
Site-to-Site Dataflows
After completing this chapter, you will be able to
▪ Describe the advantages of using Apache NiFi across multiple sites
▪ Create and configure a Remote Process Group using the NiFi interface

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-4
NiFi Site-to-Site Processing
▪ Provides direct communication between two NiFi instances
▪ Allows dataflows to push data to and receive data from dataflows on remote
systems
▪ Communicates between clusters, standalone instances, or both
▪ Handles load balancing and reliable delivery
▪ Site-to-Site optionally supports secure connections using certificates
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-5
Site-to-Site Clients
 
C
Node 1
Output Port

Node 2
Java Program
Output Port
Site-to-Site Client

Node 3
Output Port

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-6
Benefits of NiFi Site-to-Site (1)
▪ Easy to configure—Remote ports are automatically discovered
▪ Secure—Provides authentication and authorization, and optionally supports
encryption
▪ Scalable—Changes in remote cluster are automatically detected
▪ Efficient—FlowFiles can be sent in batches to reduce overhead
▪ Reliable—Sender and receiver compare checksums after data is transmitted

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-7
Benefits of NiFi Site-to-Site (2)
▪ Automatic load balancing—Amount of data directed to a node automatically
adjusted base on node’s load
▪ Protocol matching—Remote and local nodes negotiate which protocol and
version will be used
▪ Attributes—When a FlowFile is transferred, all attributes are automatically
transferred with it

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-8
Chapter Topics

Site-to-Site Dataflows
▪ Site-to-Site Theory
▪ Site-to-Site Architecture
▪ Anatomy of a Remote Process Group
▪ Adding and Configuring Remote Process Groups
▪ Hands-On Exercise: Building Site-to-Site Dataflows
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-9
Site-To-Site: One to One
▪ Source connects remote Process Group (RPG) to input port on destination
 
 
Source Target
Standalone NiFi Standalone NiFi

RPG Input Port

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-10
Site-To-Site: Push to Cluster
▪ Source connects Remote Process Group to input port on destination
▪ NiFi takes care of load balancing across the nodes in the cluster
 
Target
C
Node 1
Input Port

Source
Node 2
Standalone NiFi
Input Port
RPG

Node 3
Input Port

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-11
Site-To-Site: Pull from Source
▪ Source connects Remote Process Group to output port on the target
▪ Each node will pull different data from the target
 
Source
C
Node 1
RPG
Target
Node 2
Standalone NiFi
RPG
Output Port

Node 3
RPG

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-12
Site-To-Site: Cluster to Cluster
▪ If both source and target are clusters, each source node’s RPG will pull from
the output port on each node in target cluster
 
Target Source
C C
Node 1 Node 1
Output Port RPG

Node 2 Node 2
Output Port RPG

Node 3 Node 3
Output Port RPG

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-13
Chapter Topics

Site-to-Site Dataflows
▪ Site-to-Site Theory
▪ Site-to-Site Architecture
▪ Anatomy of a Remote Process Group
▪ Adding and Configuring Remote Process Groups
▪ Hands-On Exercise: Building Site-to-Site Dataflows
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-14
Anatomy of a Remote Process Group (1)
▪ It is sometimes necessary to transfer data in a dataflow from one instance of
NiFi to another
▪ For this reason, NiFi provides the concept of a Remote Process Group (RPG)
▪ Remote Process Groups look similar to regular Process Groups in the UI
▪ The information rendered about a RPG is related to the interaction that occurs
between this instance of NiFi and the remote instance
 
Remote
Instance
Transmission Name Remote Instance URL
Status

Secure
Indicator
5-Minute
Statistics

Last Refresh Time

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-15
Anatomy of a Remote Process Group (2)
▪ Transmission Status
─ The icon indicates whether data transmission between this instance of NiFi
and the remote instance is enabled

Enabled if any of the input or output ports are configured to transmit
─ Disabled if all of the import and output ports that are currently
connected are stopped
▪ Remote Instance Name
─ The name of the NiFi instance that was reported by the remote instance
─ When the RPG is first created, before this information has been obtained,
the URL of the remote instance will be shown instead
 
Remote Instance Name

Transmission
Status

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-16
Anatomy of a Remote Process Group (3)
▪ Secure Indicator
─ This icon indicates whether or not communications with the remote NiFi
instance are secure
─ If communications with the remote instance are secure, this will be indicated
by a “locked” icon
─ Administrator for the remote instance must configure authorization
access policies for the source NiFi node(s)
─ If the communications are not secure, this will be indicated by an “unlocked”
icon
 

Secure
Indicator

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-17
Anatomy of a Remote Process Group (4)
▪ Remote Instance URL
─ This is the URL of the remote NiFi instance that the Remote Process Group
points to
─ When target is a NiFi cluster, you can specify URLs for multiple nodes as a
comma separated list
▪ 5-Minute Statistics
─ Sent and Received
▪ Last Refreshed Time
─ The information shown for the Remote Process Group is pulled from a
remote instance and is periodically refreshed in the background
 
Remote Instance URL

5-Minute
Statistics

Last Refresh Time

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-18
Chapter Topics

Site-to-Site Dataflows
▪ Site-to-Site Theory
▪ Site-to-Site Architecture
▪ Anatomy of a Remote Process Group
▪ Adding and Configuring Remote Process Groups
▪ Hands-On Exercise: Building Site-to-Site Dataflows
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-19
Adding Remote Process Groups
▪ Add a Remote Process Group to communicate with a remote NiFi instance
▪ Configure with the URL of the remote instance of NiFi
─ The URL is the same one you would use to go to the remote instance’s UI
─ If remote instance is a cluster, you can specify any node’s URL
 
1. Drag the Remote Process Group onto
your canvas

2. Enter the
URL of the
remote NiFi
instance

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-20
Configuring Remote Process Group
▪ Right-click the RPG to enable or disable transmission or manage connected
ports
 

Enable or disable
transmission

Number of RPGs
transmitting and
not transmitting View, connect, and
disconnect remote
ports

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-21
Manage Remote Ports
▪ Use Manage Remote Reports menu item to view all input and output ports on
remote NiFi instance
─ May take a minute to detect all ports
▪ Choose which port on the remote host to connect to
 

Connected Disconnected
input port output port

Disconnected
input port

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-22
Chapter Topics

Site-to-Site Dataflows
▪ Site-to-Site Theory
▪ Site-to-Site Architecture
▪ Anatomy of a Remote Process Group
▪ Adding and Configuring Remote Process Groups
▪ Hands-On Exercise: Building Site-to-Site Dataflows
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-23
Hands-On Exercise: Building Site-to-Site Dataflows
▪ In this exercise, you will create a remote process group to interact with a flow
on a separate NiFi instance
▪ Please refer to the Hands-On Exercise Manual for instructions

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-24
Chapter Topics

Site-to-Site Dataflows
▪ Site-to-Site Theory
▪ Site-to-Site Architecture
▪ Anatomy of a Remote Process Group
▪ Adding and Configuring Remote Process Groups
▪ Hands-On Exercise: Building Site-to-Site Dataflows
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-25
Essential Points
▪ NiFi Site-to-Site dataflows provides direct communication between two NiFi
instances
▪ Using Site-to-Site, dataflows can push data to and receive data from dataflows
on remote systems
▪ Site-to-Site communications in NiFi are based on two components. Input Ports
and Remote Process Groups (RPG)

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 14-26
Cloudera Edge Management and
MiNiFi
Chapter 15
Course Chapters

▪ Introduction
▪ Introduction to Cloudera Flow Management
▪ Processors
▪ Connections
▪ Dataflows
▪ Process Groups
▪ FlowFile Provenance
▪ Dataflow Templates
▪ Apache NiFi Registry
▪ FlowFile Attributes
▪ NiFi Expression Language
▪ NiFi Architecture
▪ Dataflow Optimization
▪ Site-to-Site Dataflows
▪ Cloudera Edge Management and MiNiFi
▪ Monitoring and Reporting
▪ Controller Services
▪ Integrating NiFi with the Cloudera Ecosystem
▪ NiFi Security
▪ Conclusion

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-2
Chapter Topics

Cloudera Edge Management and MiNiFi


▪ Overview of MiNiFi
▪ Example Walk-through
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-3
Cloudera Edge Management and MiNiFi
After completing this chapter, you will be able to
▪ Describe the differences between Apache NiFi and Apache MiNiFi
▪ Explain how NiFi and MiNiFi work together to gather data

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-4
Cloudera Edge Management
Cloudera Edge and Flow Management contains
▪ NiFi
▪ MiNiFi
─ Edge agents that transmit data from edge devices to NiFi
▪ Edge Flow Manager
─ Manages, controls, and monitors edge agents to collect data from edge
devices and push intelligence back to the edge
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-5
What Is MiNiFi?
▪ MiNiFi is focused on collecting data at the source
─ Sub-project of Apache NiFi
─ Small footprint
─ No UI
 
 

▪ NiFi lives in the data center ▪ MiNiFi lives as close to the source
of the data as possible
▪ Runs on enterprise servers
▪ Agent runs as a guest on that
device or system

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-6
Key MiNiFi Features
▪ Design and deploy dataflows using Edge Flow Manager
▪ Warm re-deploys of dataflows
▪ Guaranteed delivery of data to NiFi
▪ Data buffering (back pressure)
▪ Security and data provenance
▪ Fine-grained history of data
▪ Extensible

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-7
How Does MiNiFi Interact with NiFi?
▪ MiNiFi ▪ NiFi
─ Receives flows from NiFi ─ Runs flows to receive data
─ Collects data ─ Aggregates data from many
─ Sends data for processing to sources
NiFi ─ Performs routing and
processing

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-8
Enterprise MiNiFi Solutions
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-9
Cloudera Edge Flow Manager UI (EFM)
▪ Provides a central location to manage and monitor MiNiFi
▪ Use EFM Flow Designer to create flows
─ Similar to NiFi canvas
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-10
Flow Authorship and Control
1. User designs flow in Edge Flow Manager (EFM) UI
2. User saves completed flow to NiFi Registry
3. EFM pushes flows to configured NiFi agents
Example: EFM pushes Class A flow to only Class A devices
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-11
Chapter Topics

Cloudera Edge Management and MiNiFi


▪ Overview of MiNiFi
▪ Example Walk-through
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-12
Example: Sensor Readings
▪ The next several slides walk through an example MiNiFi flow
▪ MiNiFi agents use a lightweight messaging protocol called MQTT to collect
sensor readings from various devices
▪ Flow on MiNiFi agents sends data to central NiFi cluster
 

* See full example on GitHub


Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-13
1. MiNiFi UI: Add a ConsumeMQTT Processor to MiNiFi Flow
▪ In the Edge Flow Manager UI, add a ConsumeMQTT to receive sensor data
from devices
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-14
2. MiNiFi UI: Add a Remote Process Group to MiNiFi Flow
▪ Add a Remote Process Group to send data to NiFi
▪ Configure it with the NiFi URL
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-15
3. NiFi UI: Add Input Port to NiFi Flow
▪ In the NiFi UI, add an input port to accept data from MiNiFi
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-16
4. NiFi UI: Add a Funnel to the NiFi Flow
▪ Add a funnel to the NiFi flow to terminate the input port’s relationships for
testing
─ You can later connect to another Processor to send to Kafka
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-17
5. NiFi UI: Find ID of Input Port
▪ Find the ID of the NiFi input port to connect to the MiNiFi output port in the
next step
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-18
6. MiNiFi UI: Connect to MiNiFi flow to the NiFi flow
▪ Connect the ConsumeMQTT Processor to the NiFi flow
─ Configure the ConsumeMQTT destination port ID to the NiFi flow’s input
port ID
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-19
7. NiFi Registry UI: Create a Bucket for MiNiFi Flow
▪ Before publishing the MiNiFi flow, you must create a NiFi Registry bucket
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-20
8. MiNiFi UI: Publish First Version of MiNiFi Flow
▪ Publish the flow
─ Add to the bucket created in the previous step
▪ Will automatically deploy to MiNiFi agents
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-21
9. NiFi Registry UI: Confirm MiNiFi Flow in Bucket
▪ The MiNiFi flow is now being versioned by NiFi Registry
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-22
Chapter Topics

Cloudera Edge Management and MiNiFi


▪ Overview of MiNiFi
▪ Example Walk-through
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-23
Essential Points
▪ MiNiFi is a subproject of Apache NiFi that is focused on the collection of data
at the source of its creation
▪ It manages, controls, and monitors edge agents to collect data from edge
devices and push intelligence back to the edge
▪ MiNiFi can pass data for an edge device (for example, a sensor) to a central
NiFi cluster using MQTT and an Input port
▪ You can build and deploy MiNiFi flows using Cloudera Edge Flow Manager

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 15-24
Monitoring and Reporting
Chapter 16
Course Chapters

▪ Introduction
▪ Introduction to Cloudera Flow Management
▪ Processors
▪ Connections
▪ Dataflows
▪ Process Groups
▪ FlowFile Provenance
▪ Dataflow Templates
▪ Apache NiFi Registry
▪ FlowFile Attributes
▪ NiFi Expression Language
▪ NiFi Architecture
▪ Dataflow Optimization
▪ Site-to-Site Dataflows
▪ Cloudera Edge Management and MiNiFi
▪ Monitoring and Reporting
▪ Controller Services
▪ Integrating NiFi with the Cloudera Ecosystem
▪ NiFi Security
▪ Conclusion

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 16-2
Chapter Topics

Monitoring and Reporting


▪ Monitoring from NiFi
▪ Overview of Reporting
▪ Examples of Common Reporting Tasks
▪ Hands-On Exercise: Monitoring and Reporting
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 16-3
Monitoring and Reporting
After completing this chapter, you will be able to
▪ Interpret the Apache NiFi interface elements used to monitor performance
▪ Use reporting tasks to manage how the system performance is reported
▪ Use bulletins and the summary page to identify issues with Processors

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 16-4
Monitoring NiFi
▪ Monitor system performance to assess resource usage and health
▪ Monitoring mechanisms include
─ Status bar
─ Component statistics
─ Bulletins
─ Summary page

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 16-5
Monitoring Features of the Status Bar
Locally
modified
Versioned
Total size Stopped Process
and number components Groups
of FlowFiles
Non- Disabled
currently Locally modified and stale
transmitting components
queued Versioned Process Groups
RPG

Active Transmitting Invalid Stale


threads RPG components Versioned
Process
Groups
Running Up to date Sync failure
components Versioned Versioned
Process Process
Groups Groups

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 16-6
Component Statistics
▪ Most Processors, Process Groups, and Remote Process Group surface panels
show statistics for the previous five minutes
─ Number of FlowFiles in and out
─ Amount of content data read and written to disk
▪ Processors will display the number of active threads
▪ Connections show the number of items currently in the queue
 
Active threads

Connection queue
5 minute statistics

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 16-7
Status History
▪ Shows statistics for the last 24 hours
▪ Provides additional metrics such as average and total task duration, lineage
duration, and FlowFiles removed
▪ Can zoom in on specific ranges

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 16-8
Bulletins (1)
▪ Bulletins show problems that have occurred with a Processor or Process Group
─ Warnings or errors in the last five minutes will trigger a bulletin indicator
▪ Bulletins indicate which node in the cluster emitted the bulletin if applicable
▪ You can change the log level for bulletins in the component’s settings
configuration tab
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 16-9
Bulletins (2)
▪ View the bulletin board for a history of all bulletins
─ Search and filter by message, or by Processor or Process Group name or ID
─ Click on the UUID to go to the component on the canvas
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 16-10
The Summary Page
▪ The NiFi Summary page shows current status and 5-minute statistics for all
components
─ Such as Processors, Process Groups, Connections, and so on
▪ Access the summary page using global menu
▪ Page contains similar information to the component surface panels
▪ Includes links to individual components, component details, and statistics
 

Choose component type Open summary in a


separate browser window

Search by aribute value

Open component details Links to component,


and configuraon panel stascs, and cluster
informaon

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 16-11
Chapter Topics

Monitoring and Reporting


▪ Monitoring from NiFi
▪ Overview of Reporting
▪ Examples of Common Reporting Tasks
▪ Hands-On Exercise: Monitoring and Reporting
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 16-12
Reporting Tasks (1)
▪ Reporting tasks run in the background to provide reports about what is
happening in NiFi
▪ There are many reporting tasks including
─ MonitorDiskUsage
─ MonitorMemory
─ SiteToSiteStatusReportingTask
─ SiteToSiteBulletinReportingTask
─ SiteToSiteProvenanceReportingTask
─ ControllerStatusReportingTask
─ ScriptedReportingTask
▪ Add and configure reporting tasks in the NiFi Settings window

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 16-13
Reporting Tasks (2)
▪ In the global menu, select Controller Settings
▪ Use the REPORTING TASKS tab to
─ Add, configure, enable, disable, and remove reporting tasks
─ View reports
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 16-14
Chapter Topics

Monitoring and Reporting


▪ Monitoring from NiFi
▪ Overview of Reporting
▪ Examples of Common Reporting Tasks
▪ Hands-On Exercise: Monitoring and Reporting
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 16-15
Monitoring Disk Usage
▪ Use MonitorDiskUsage task to check storage space used by the content
and FlowFile repositories
▪ Warns with log message and a system-level bulletin when disk usages exceeds
threshold
▪ Properties
─ Threshold—task will report when usage exceeds specified percentage
(default = 80%)
─ Directory Location—directory path to monitor (required, no default)
─ Directory Display Name—name to display in alerts (optional, default = Un-
Named)

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 16-16
Monitoring JVM Memory
▪ MonitorMemory task tracks Java heap memory available for a particular
JVM memory pool
▪ Generates a WARNING log message and a system-level bulletin when usage
exceeds threshold
▪ Properties:
─ Memory Pool (required, no default)
─ Code Cache
─ Metaspace
─ Compressed Class Space
─ G1 Eden Space
─ G1 Survivor Space
─ G1 Old Gen
─ Usage Threshold (required, default = 65%)
─ Reporting Interval—how often task should check usage (optional, no default)

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 16-17
Reporting on Processors and Connections
▪ ControllerStatusReportingTask logs the 5-minute statistics shown
in the NiFi Summary Page
─ Optionally logs deltas between the previous and the current iterations
▪ Processor statistics include
─ Status
─ FlowFiles in and out
─ Bytes read and written to disk
─ Tasks completed
─ Processing time
▪ Connection statistics include
─ FlowFiles in and out
─ FlowFiles queued

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 16-18
Site-to-Site Reporting
▪ The Site-to-Site protocol provides the ability to report information from other
NiFi instances
─ Can use regex or comma-separated list to report from multiple NiFi instances
▪ Can report on
─ Status
─ Metrics
─ Bulletins
─ FlowFile provenance

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 16-19
Scripted Reporting
▪ ScriptedReportingTask allows you to call a script
─ You can write the script to perform additional reporting tasks
▪ Passes key information to the script such as events, provenance, bulletins, and
JVM metrics
▪ Properties
─ Script Engine—select the type of script, such as Python, Groovy, or Ruby
(required, default = Clojure)
─ Script File or Script Body—configure the task with either the code to execute
or a pointer to a file to execute (one of the two properties is required, no
default)
─ Module Directory—modules to include when running the script

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 16-20
Chapter Topics

Monitoring and Reporting


▪ Monitoring from NiFi
▪ Overview of Reporting
▪ Examples of Common Reporting Tasks
▪ Hands-On Exercise: Monitoring and Reporting
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 16-21
Hands-On Exercise: Monitoring and Reporting
▪ In this exercise, you will explore NiFi’s monitoring and reporting features
▪ Please refer to the Hands-On Exercise Manual for instructions

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 16-22
Chapter Topics

Monitoring and Reporting


▪ Monitoring from NiFi
▪ Overview of Reporting
▪ Examples of Common Reporting Tasks
▪ Hands-On Exercise: Monitoring and Reporting
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 16-23
Essential Points
▪ NiFi has a number of features that monitor system performance to help you to
understand the resource usage and general health of the system
▪ Monitoring tools include
─ Canvas status bar
─ Component statistics
─ Bulletins
─ Summary page
▪ Use Reporting Tasks to create reports monitoring activities including
─ Disk usage
─ Memory usage
─ Bulletins
─ Controller status

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 16-24
Controller Services
Chapter 17
Course Chapters

▪ Introduction
▪ Introduction to Cloudera Flow Management
▪ Processors
▪ Connections
▪ Dataflows
▪ Process Groups
▪ FlowFile Provenance
▪ Dataflow Templates
▪ Apache NiFi Registry
▪ FlowFile Attributes
▪ NiFi Expression Language
▪ NiFi Architecture
▪ Dataflow Optimization
▪ Site-to-Site Dataflows
▪ Cloudera Edge Management and MiNiFi
▪ Monitoring and Reporting
▪ Controller Services
▪ Integrating NiFi with the Cloudera Ecosystem
▪ NiFi Security
▪ Conclusion

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 17-2
Chapter Topics

Controller Services
▪ Controller Services Overview
▪ Common Controller Services
▪ Hands-On Exercise: Adding Apache Hive Controller
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 17-3
Controller Services
After completing this chapter, you will be able to
▪ Describe the advantages of using a Controller Service in Apache NiFi
▪ Configure a Controller Service and use it in a Processor

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 17-4
Controller Services
▪ A single location to configure shared services
─ Provides information to be used by reporting tasks, Processors, and other
components
─ Configure once, re-use wherever needed
▪ Are useful for secure information, such as database names, database users,
and passwords
─ Allows tight restrictions on who can access controllers
─ Allows other data engineers to use the controller without gaining access to
authorization information
▪ Reduces multiple connect strings

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 17-5
Controller Services Configuration
▪ Use the NiFi Settings CONTROLLER SERVICES tab to configure controllers
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 17-6
Chapter Topics

Controller Services
▪ Controller Services Overview
▪ Common Controller Services
▪ Hands-On Exercise: Adding Apache Hive Controller
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 17-7
DBCPConnectionPool
▪ Database connection pooling service
▪ Connections can be requested from pool and returned after usage
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 17-8
HiveConnectionPool
▪ Provides connection pooling service for Apache Hive
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 17-9
HBaseClientService
▪ Provides a connection to Apache HBase
▪ NiFi includes implementations for supported versions of HBase

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 17-10
JMSConnectionFactoryProvider
▪ Provides the ability to connect to Java Message Service
─ A factory service for vendor-specific javax.jms.ConnectionFactory
implementations
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 17-11
AWSCredentialsProviderControllerService
▪ Defines credentials for Amazon Web Services Processors
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 17-12
Chapter Topics

Controller Services
▪ Controller Services Overview
▪ Common Controller Services
▪ Hands-On Exercise: Adding Apache Hive Controller
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 17-13
Hands-On Exercise: Adding Apache Hive Controller
▪ In this exercise, you will add controller to provide access to Hive
▪ Please refer to the Hands-On Exercise Manual for instructions

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 17-14
Chapter Topics

Controller Services
▪ Controller Services Overview
▪ Common Controller Services
▪ Hands-On Exercise: Adding Apache Hive Controller
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 17-15
Essential Points
▪ Controller Services
─ Provides a single location to configure shaed services
─ Can be used to share secure information like database names, database
users, and passwords
─ Reduces multiple connect strings
▪ To create a new Controller Service, click on the Configuration icon on the
Operate palette
─ In the CONTROLLER SERVICES tab, click on + to add a new Controller Service

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 17-16
Integrating NiFi with the
Cloudera Ecosystem
Chapter 18
Course Chapters

▪ Introduction
▪ Introduction to Cloudera Flow Management
▪ Processors
▪ Connections
▪ Dataflows
▪ Process Groups
▪ FlowFile Provenance
▪ Dataflow Templates
▪ Apache NiFi Registry
▪ FlowFile Attributes
▪ NiFi Expression Language
▪ NiFi Architecture
▪ Dataflow Optimization
▪ Site-to-Site Dataflows
▪ Cloudera Edge Management and MiNiFi
▪ Monitoring and Reporting
▪ Controller Services
▪ Integrating NiFi with the Cloudera Ecosystem
▪ NiFi Security
▪ Conclusion

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-2
Chapter Topics

Integrating NiFi with the Cloudera Ecosystem


▪ NiFi Integration Architecture
▪ NiFi Ecosystem Processors
▪ A Closer Look at NiFi and Apache Hive
▪ A Closer Look at NiFi and Apache Kafka
▪ Hands-On Exercise: Integrating Dataflows with Kafka and HDFS
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-3
Integrating NiFi with the Cloudera Ecosystem
After completing this chapter, you will be able to
▪ Describe the larger Cloudera architecture and where Apache NiFi sits
▪ List the NiFi Processors that can be used to connect to other Cloudera services
▪ Describe how NiFi works with Apache Hive and Apache Kafka

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-4
Cloudera Data Platform and Flow Management Together
▪ NiFi integrates seamlessly with Cloudera Data Platform (CDP)
 
Cloudera Data Platform

CLOUDERA FLOW MANAGEMENT


powered by Apache NiFi Perishable
Insights

Store Data Enrich


and Context
Metadata

Internet
of Anything
Historical
Insights

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-5
NiFi in CDP
▪ NiFi is CDP’s data ingestion and dataflow solution
▪ Dynamically connects and conducts data into other CDP services
▪ Secures and encrypts data before sending
▪ Offers traceability on the data’s flow from the source, with lineage and audit
trails before it reached CDP
▪ Models flows graphically to dynamically adjust data coming to CDP
▪ Includes mature IoAT data protocols that improve device extensibility
▪ Manages IoAT flows bi-directionally with easy optimization and adjustment

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-6
NiFi Enterprise Architecture
Source Database

RDBMS

Kafka MQ

Streaming Applications
Kafka (Flink, Spark, Kafka
Streams)

ETL Action
Standalone NiFi

NiFi Data Bus

Standalone NiFi
Hive Data Warehouse

HDFS
Disk Hive Interactive DB
Archive Server

Source Log Server

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-7
NiFi and CDP Architecture
Cloudera DataFlow Hadoop
Raw Network Stream

Kaa Service
Streams Management /
Network Metadata Stream
Phoenix Spark Workflow

Data Stores

NiFi Flink
Syslog HBase Hive SOLR

Raw Applicaon Logs


SIEM
Spark YARN
Streaming
Other Streaming Telemetry
HDFS

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-8
Chapter Topics

Integrating NiFi with the Cloudera Ecosystem


▪ NiFi Integration Architecture
▪ NiFi Ecosystem Processors
▪ A Closer Look at NiFi and Apache Hive
▪ A Closer Look at NiFi and Apache Kafka
▪ Hands-On Exercise: Integrating Dataflows with Kafka and HDFS
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-9
Cloudera Ecosystem Integration Processors
▪ NiFi includes a number of Processors so flows can work with other ecosystem
components, such as
─ Apache HDFS—Hadoop Distributed File System
─ Apache HBase—NoSQL data store built on HDFS
─ Apache Hive—SQL front-end for distributed data
─ Apache Kudu—column-oriented data store
─ Apache Solr—enterprise search
─ Apache Kafka-distributed event streaming platform
▪ It also includes Processors to support popular structured data formats
─ Parquet
─ JSON
─ Avro

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-10
NiFi and HDFS
▪ There are several HDFS-related Processors
▪ Specify the HDFS file/directory path to read/write
▪ Optional: Kerberos credentials are required if connecting to Kerberos-enabled
environment
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-11
NiFi and Apache HBase
▪ Configure HBase Client Service and Distributed Cache Services
▪ Provide schema information, such as table names, to pull and push data
▪ You can use Phoenix JDBC to connect and query HBase with regular SQL
connectors
 

PutHBaseCell
GetHBase
PutHBaseRecord
NiFi FetchHBaseRow NiFi
PutHBaseJSON

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-12
Chapter Topics

Integrating NiFi with the Cloudera Ecosystem


▪ NiFi Integration Architecture
▪ NiFi Ecosystem Processors
▪ A Closer Look at NiFi and Apache Hive
▪ A Closer Look at NiFi and Apache Kafka
▪ Hands-On Exercise: Integrating Dataflows with Kafka and HDFS
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-13
What Is Apache Hive?
▪ Data warehouse infrastructure for Apache Hadoop
▪ Included with CDP
▪ Uses a SQL-like language called HiveQL
▪ Supports querying data in the Hadoop Distributed File System (HDFS) and
other storage back ends
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-14
NiFi and Apache Hive
▪ NiFi can retrieve data from and store data to Hive tables
▪ Configure a HiveConnectionPool controller to connect to Hive server
▪ Use SelectHiveQL to query Hive tables and provide output in Avro or CSV
format
▪ Use PutHiveQL to add data to a Hive table
▪ Use PutHiveStreaming to stream FlowFile content into Hive tables
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-15
Hive Processors—SelectHiveQL
▪ SelectHiveQL executes provided HiveQL SELECT query
─ Query results are returned as Avro or CSV format FlowFile content

Required Properties Default


Hive Database Connection Pooling Service No default

Fetch Size Ignored

Max Rows Per Flow File All rows returned in a single FlowFile

Maximum Number of Fragments All fragments are returned

Output Format Avro

Normalize Table/Column Names False

CSV Header True

CSV Delimiter Comma

CSV Quote True

CSV Escape True

Character Set UTF-8


Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-16
Hive Processors—PutHiveQL
▪ PutHiveQL executes a HiveQL DDL/DML command such as UPDATE or
INSERT
─ The content of an incoming FlowFile is expected to be the HiveQL command
to execute

Required Properties Default


Hive Database Connection Pooling Service No default

Batch Size 100

Character Set UTF-8

Statement Delimiter Semicolon

Rollback on Failure False

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-17
Hive Processors—PutHiveStreaming
▪ Uses Hive Streaming to send flow file data to an Apache Hive table
─ The incoming flow file is expected to be in Avro format and the table must
exist in Hive

Required Properties Default


Hive Metastore URI No default

Database Name No default

Table Name No default

Auto-Create Partitions True

Max Open Connections 8

Heartbeat Interval 60

Transactions per Batch 100

Record per Transaction 1000

Call Timeout Indefinitely

Rollback on Failure False


Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-18
Chapter Topics

Integrating NiFi with the Cloudera Ecosystem


▪ NiFi Integration Architecture
▪ NiFi Ecosystem Processors
▪ A Closer Look at NiFi and Apache Hive
▪ A Closer Look at NiFi and Apache Kafka
▪ Hands-On Exercise: Integrating Dataflows with Kafka and HDFS
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-19
Kafka Basics (1)
▪ Apache Kafka is a fast, scalable, distributed publish-subscribe messaging
system that provides
─ Durability by persisting data to disk
─ Fault tolerance through replication
▪ A message is a single data record passed by Kafka
▪ One or more brokers in a cluster receive, store, and distribute messages

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-20
Kafka Basics (2)
▪ A topic is a named feed or queue of messages
─ A Kafka cluster can include any number of topics
▪ Producers are programs that publish (send) messages to a topic
▪ Consumers are programs that subscribe to (receive messages from) a topic
▪ Consumer groups are related consumers that share responsibility for
processing messages on a particular topic
▪ Kafka allows a topic to be partitioned
─ Topic partitions are handled by different brokers for scalability
─ Note that topic partitions are not related to DataFrame or RDD partitions

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-21
Kafka Basics (3)

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-22
Kafka Basics (4)
▪ Kafka divides each topic into partitions
─ Topic partitioning improves scalability and throughput
▪ A topic partition is an ordered sequence of messages
─ New messages are appended to the partition as they are received
─ Each message is assigned a unique sequential ID known as an offset
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-23
NiFi and Apache Kafka

PublishKaa ConsumeKaa

NiFi ConsumeKaaRecord NiFi


PublishKaaRecord

Kaa Streams

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-24
Kafka Processors: ConsumeKafka
▪ Pulls messages from one or more Kafka topics
▪ Adds message content to FlowFile content (claim)

Important Property Detail Default


Kafka Brokers Comma-separated list of localhost:9092
hosts and ports
Security Protocol PLAINTEXT, SSL, PLAINTEXT
SASL_PLAINTEXT, or
SASL_SSL
Topic Name(s) A single topic, a comma- No default
separated list, or a regular
expression
Topic Name Format Whether Topic Name(s) name
property is a list of names
or a regular expression
Group ID No default

Offset Reset Whether to start reading latest


from first or latest message

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-25
Kafka Processors: ProduceKafka
▪ Sends messages to a Kafka topic with FlowFile content as the message content
▪ The messages may be individual FlowFiles or may be delimited
▪ Important properties

Important Property Detail Default


Kafka Brokers Comma-separated list of localhost:9092
hosts and ports
Security Protocol PLAINTEXT, SSL, PLAINTEXT
SASL_PLAINTEXT, or
SASL_SSL
Topic Name No default

Delivery Guarantee Best Effort, Guarantee Best Effort


Single Node Delivery,
Guarantee Replicated
Delivery
Compression Type gzip, snappy, lz4, or none No default

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-26
Chapter Topics

Integrating NiFi with the Cloudera Ecosystem


▪ NiFi Integration Architecture
▪ NiFi Ecosystem Processors
▪ A Closer Look at NiFi and Apache Hive
▪ A Closer Look at NiFi and Apache Kafka
▪ Hands-On Exercise: Integrating Dataflows with Kafka and HDFS
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-27
Hands-On Exercise: Integrating Dataflows with Kafka and
HDFS
▪ In this exercise, you will create dataflows with processors that interact with
Kafka and HDFS
▪ Please refer to the Hands-On Exercise Manual for instructions

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-28
Chapter Topics

Integrating NiFi with the Cloudera Ecosystem


▪ NiFi Integration Architecture
▪ NiFi Ecosystem Processors
▪ A Closer Look at NiFi and Apache Hive
▪ A Closer Look at NiFi and Apache Kafka
▪ Hands-On Exercise: Integrating Dataflows with Kafka and HDFS
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-29
Essential Points
▪ NiFi integrates with the Cloudera Enterprise Data Hub and the rest of the
Cloudera ecosystem
▪ Cloudera provides a number of integration processors so flows can work with
other ecosystem components
▪ Popular integration processors include
─ HDFS
─ Apache Hive
─ Apache Kafka

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 18-30
NiFi Security
Chapter 19
Course Chapters

▪ Introduction
▪ Introduction to Cloudera Flow Management
▪ Processors
▪ Connections
▪ Dataflows
▪ Process Groups
▪ FlowFile Provenance
▪ Dataflow Templates
▪ Apache NiFi Registry
▪ FlowFile Attributes
▪ NiFi Expression Language
▪ NiFi Architecture
▪ Dataflow Optimization
▪ Site-to-Site Dataflows
▪ Cloudera Edge Management and MiNiFi
▪ Monitoring and Reporting
▪ Controller Services
▪ Integrating NiFi with the Cloudera Ecosystem
▪ NiFi Security
▪ Conclusion

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-2
Chapter Topics

NiFi Security
▪ NiFi Security Overview
▪ Securing Access to the NiFi UI
▪ Authentication
▪ Authorization
▪ NiFi Registry Security
▪ NiFi Security Summary
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-3
NiFi Security
After completing this chapter, you will be able to
▪ List the five elements of Apache NiFi security
▪ Describe how to control access to NiFi
▪ Describe how to control access to NiFi registry

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-4
NiFi Security
▪ Security is essential in a production NiFi installation
▪ NiFi provides several elements to control who can access data and perform
actions
─ Administration—central management and consistent security
─ Authentication—confirm the identity of users and systems
─ Authorization—determine whether a participant is allowed to perform an
action
─ Auditing—maintain a record of data access
─ Data Protection—prevent unauthorized access to data at rest and in motion

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-5
NiFi Security Features

Administration ▪ Automatic NiFi cluster coordinator and primary node election with
Apache ZooKeeper
▪ Multiple entry points

Authentication ▪ 2-Way TLS/SSL support out of the box


▪ Supports LDAP and Kerberos integration

Authorization ▪ Multi-tenant authorization


▪ File-based authority provider—global and component-level access
policies

Auditing ▪ Data provenance


▪ Detailed logging of all user actions and key system behaviors

Data ▪ Supports a variety of SSL/encrypted protocols


Protection ▪ Tag and utilize tags on data for fine grained access controls
▪ Encrypt/decrypt content using pre-shared key mechanisms
▪ Passwords in configuration files are encrypted

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-6
Chapter Topics

NiFi Security
▪ NiFi Security Overview
▪ Securing Access to the NiFi UI
▪ Authentication
▪ Authorization
▪ NiFi Registry Security
▪ NiFi Security Summary
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-7
TLS/SSL Authentication
▪ NiFi provides two-way (mutual) TLS/SSL authentication
▪ Two parties (the NiFi server and the web browser or client) authenticate each
other
▪ Both verify the public key certificate/digital certificate issued by the trusted
Certificate Authorities (CAs)
─ A CA is a trusted entity that issues certificates such as Verisign and Microsoft
Certificate Server
▪ TLS/SSL must be enabled before enabling other NiFi security features

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-8
How Two-way TLS/SSL Works

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-9
Chapter Topics

NiFi Security
▪ NiFi Security Overview
▪ Securing Access to the NiFi UI
▪ Authentication
▪ Authorization
▪ NiFi Registry Security
▪ NiFi Security Summary
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-10
LDAP Authentication
▪ NiFi supports user authentication using
─ Client certificates
─ Using SSL
─ Username and password
─ Using a Login Identity Provider—a pluggable mechanism for
authenticating users such as LDAP

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-11
What Is Kerberos?
▪ Kerberos is a widely used protocol for network authentication
▪ Kerberos authenticates users and network nodes

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-12
Kerberos Exchange Participants (1)
▪ Kerberos involves messages exchanged among three parties
─ The client
─ The server providing a desired network service
─ The Kerberos Key Distribution Center (KDC)

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-13
Kerberos Exchange Participants (2)

Kerberos KDC
Key Distribution
Center)

Client

Desired Network
Service
(Protected by
Kerberos)

▪ The client is software that desires access to a service


─ Such as a web browser visiting the NiFi UI

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-14
Kerberos Exchange Participants (3)

Kerberos KDC
Key Distribution
Center)

Client

Desired Network
Service
(Protected by
Kerberos)

▪ This is the service (such as NiFi) the client wishes to access

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-15
Kerberos Exchange Participants (4)

Kerberos KDC
Key Distribution
Center)

Client

Desired Network
Service
(Protected by
Kerberos)

▪ The Kerberos server (KDC) authenticates clients

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-16
Kerberos Concepts (1)

Kerberos KDC
1. Sends username,
requests authentication Key Distribution
Center)

2. Authenticates client,
returns service ticket
Client

Desired Network
Service
3. Uses services ticket (Protected by
to request service Kerberos)

▪ Client requests authentication for a user principal


▪ Kerberos authenticates the user and returns a service ticket
▪ Client connects to a service and passes the service ticket
─ Services protected by Kerberos do not directly authenticate the client

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-17
Chapter Topics

NiFi Security
▪ NiFi Security Overview
▪ Securing Access to the NiFi UI
▪ Authentication
▪ Authorization
▪ NiFi Registry Security
▪ NiFi Security Summary
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-18
Authorization Model
▪ Authorization is delegated to a pluggable authorizer
▪ AAuthorizer authorizes each request based on user identity, action, and
resource
▪ Authorizer determines if the user can perform an action on the given resource
─ Example: user1 attempts to modify properties on processor1:
─ User Identity: user1
─ Action: WRITE
─ Resource: processor1 (uuid)
▪ If authorizer says resource not found, parent is checked; if parent is not found,
parent’s parent is checked, and so on

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-19
Initial Administrator User Identity
▪ When you set up a secured NiFi instance for the first time, you must manually
designate an “Initial Admin Identity”
▪ Admin user is granted access to the UI and given the ability to create
additional users, groups, and policies
 

Global Menu Policies


option to access
global policies

Global Menu Users


option to access
Lock icon to access policies for Users/Groups
currently selected components

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-20
Chapter Topics

NiFi Security
▪ NiFi Security Overview
▪ Securing Access to the NiFi UI
▪ Authentication
▪ Authorization
▪ NiFi Registry Security
▪ NiFi Security Summary
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-21
NiFi Registry Security
▪ NiFi Registry is secured in the same way as NiFi is secured
─ Configure NiFi Registry over HTTP
─ Configure a keystore and truststore for NiFi Registry
─ Configure authentication method(s)
─ TLS/SSL (always enabled)
─ SPNEGO (Simple and Protected GSSAPI Negotiation Mechanism)
─ Login provider

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-22
NiFi Registry Security Policies

Local Policy Ranger Policy Allowed functions

Can manage /Policies Read—Can read all policies


policies Delete—Can delete policies
Write—Can create new policies

Can manage /tenants Read—Can view existing users


users Write—Can add new local users/groups
Delete—Can delete local users/groups

Can proxy user /proxy Can proxy user requests


requests (All NiFi nodes must be granted this policy)

Can manage /buckets Read—Can read all buckets (All NiFi nodes
buckets must have read access)
Write—Can create new buckets
Delete—Can delete buckets

Bucket policy /buckets/bucket Read—Can import a flow to NiFi


UUID Write—Can add version control a new flow
Delete—Can delete a flow from a bucket

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-23
Chapter Topics

NiFi Security
▪ NiFi Security Overview
▪ Securing Access to the NiFi UI
▪ Authentication
▪ Authorization
▪ NiFi Registry Security
▪ NiFi Security Summary
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-24
Steps to Secure NiFi
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-25
NiFi Cluster Security Architecture
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-26
NiFi Cluster and Site-to-Site Security Architecture
 

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-27
Chapter Topics

NiFi Security
▪ NiFi Security Overview
▪ Securing Access to the NiFi UI
▪ Authentication
▪ Authorization
▪ NiFi Registry Security
▪ NiFi Security Summary
▪ Essential Points

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-28
Essential Points
▪ NiFi security is comprised of the following security elements
─ Administration - for central management and consistent security
─ Authentication - to confirm the identity of users and systems
─ Authorization - to determine whether a participant is allowed to perform an
action
─ Auditing - to maintain a record of data access
─ Data Protection - to prevent unauthorized access to data at rest and in
motion
▪ Setting up secured NiFi instance for the first time, you must manually
designate an “Initial Admin Identity”
▪ NiFi Registry is secured in the same way as NiFi is secured

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 19-29
Conclusion
Chapter 20
Course Chapters

▪ Introduction
▪ Introduction to Cloudera Flow Management
▪ Processors
▪ Connections
▪ Dataflows
▪ Process Groups
▪ FlowFile Provenance
▪ Dataflow Templates
▪ Apache NiFi Registry
▪ FlowFile Attributes
▪ NiFi Expression Language
▪ NiFi Architecture
▪ Dataflow Optimization
▪ Site-to-Site Dataflows
▪ Cloudera Edge Management and MiNiFi
▪ Monitoring and Reporting
▪ Controller Services
▪ Integrating NiFi with the Cloudera Ecosystem
▪ NiFi Security
▪ Conclusion

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 20-2
Course Objectives (1)
During this course, you have learned
▪ About Cloudera Flow Management in the context of the Cloudera Dataflow
Data-in-Motion Platform
▪ How NiFi and MiNiFi fit into the Cloudera Edge to AI paradigm
▪ About the NiFi Architecture, including standalone and clustered configurations
▪ About the key features, concepts, and benefits of NiFi
▪ How FlowFiles, processors, process groups, controllers, and connections work
together to define a NiFi dataflow

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 20-3
Course Objectives (2)
▪ To navigate, configure dataflows, and use dataflow information with the NiFi
User Interface
▪ To trace the life of data, its origin, transformation and destination, using data
provenance
▪ To organize and simplify dataflows
▪ How to manage dataflow versions using the NiFi Registry
▪ How to use the NiFi Expression Language to control dataflows
▪ About dataflow optimization methods and available monitoring and reporting
features

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 20-4
Which Course to Take Next
▪ For developers
─ Cloudera Developer Training for Apache Spark and Hadoop
─ Cloudera Search Training
─ Cloudera Training for Apache HBase
▪ For system administrators
─ Cloudera Administrator Training for Apache Hadoop
─ Cloudera Security Training
─ Cloudera HDP Operations: Administration Foundations
▪ For data analysts and data scientists
─ Cloudera Data Analyst Training
─ Cloudera Data Scientist Training

Copyright © 2010–2021 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 20-5

You might also like