You are on page 1of 13

Big Data Huawei Course

Flume
AVISO
Este documento foi gerado a partir de um material de estudo
da Huawei. Considere as informações nesse documento
como material de apoio.

Centro de Inovação EDGE - Big Data Course


Table of Contents

1. What is Flume ................................................................................................ 1

1.1. Functions of Flume ................................................................................ 1

1.2. Position of Flume in FusionInsight HD................................................... 1

2. Architecture of Flume ..................................................................................... 2

2.1. Source ................................................................................................... 3

2.2. Channel ................................................................................................. 4

2.3. Sink ........................................................................................................ 5

3. Log Collection................................................................................................. 6

4. Multi-level Cascading and Multi-Channel Duplication .................................... 7

5. Data Monitoring in FusionInsight HD.............................................................. 7

6. Transmission Reliability.................................................................................. 7

6.1. Failover .................................................................................................. 8

7. Data Filtering During Transmission ................................................................ 9

Centro de Inovação EDGE - Big Data Course


Flume - Huawei Course
1. What is Flume

� Flume is an open-source log system, which is used for collecting, processing and trans-
ferring data. It is a distributed highly reliable and highly available massive log aggrega-
tion system.
� Flume supports customized data transmitters for collecting data. Flume also roughly
processes data and writes data to customizable data receivers. Flume is applicable to
the scenarios of efficiently collecting, aggregating and moving large amounts of log
data from many different sources to a centralized data store.
� The use of Flume is not only restricted to log data aggregation. Since data sources are
customizable, Flume can be used to transport massive quantities of event data includ-
ing, but not limited, to network traffic data, social media generated data, email mes-
sages and pretty much any possible data source.

1.1. Functions of Flume

� Collect logs from a specified directory and save the logs to HDFS, HBase or Kafka.
� Also, Flume can collect and save logs to a specified path in real time.
� Flume supports the cascading mode, which means multiple Flume nodes interwork
with each other for aggregating data.
� Besides, users can customize the data collection (collect customizable data).

1.2. Position of Flume in FusionInsight HD

Centro de Inovação EDGE - Big Data Course 1


� In FusionInsight HD, the position of Flume is here. Flume is a distributed framework of
collecting and aggregating data.
� It can be used to ingest data into FusionInsight HD.

2. Architecture of Flume

� A Flume event is defined as a unit of data flow having a byte payload and an optional
set of stream attributes.
� A Flume Agent is a process that hosts the components through which events flow from
external source to the next destination.

o A Flume source consumes events delivered to it by external sources like a Web


server. The external source sends events to Flume in a format that is recognized
by the target Flume source. For example, an Avro Flume source can be used to
receive Avro events from Avro clients or Avro Flume agents in the flow. A similar
flow can be defined using a Thri� Flume source to receive events from a Thri�
sink of Flume Thri� RPC client or Thri� client wri�en in any language generated
from the Flume Thri� protocol.

o When a Flume source receives an event, it stores it into one or more channels.

o The channel is a passive store that keeps the event un�l it is consumed by a
Flume sink.

o The sink removes the event from the channel and puts it into an external repos-
itory like HDFS or forward it to the Flume source of the next Flume Agent in the
flow.

Centro de Inovação EDGE - Big Data Course 2


o The source and sink within the given Agent run asynchronously (in parallel)
with the events staged in the channel.

� Flume allows users to build multi-agent flows where events travel through multiple
Agents before reaching the final destination.

� Sources are just the source of data.


� Flume can model and abstract raw data as a data object that can be processed by the
framework, which is called an Event.
� Channel processor saves the data transmitted from the source to the channel.
� Interceptor filters and modifies the collected data based on the user configuration.
� Channel selector saves data to different channels based on the user configuration.
� The main function of a channel is just to temporarily save the data.
� Sink Runner drives the Sink processor and the Sink Processor drives the Sink to obtain
data from the channel using different policies based on the user configuration.
� These policies include load balancing, fail-over, and straight-through (default).
� The main function of Sink is to obtain data from the Channel and save the data to dif-
ferent paths.
� An Event is a minimum unit of a data stream. It is transmitted from the external data
source to the destination.

2.1. Source

� A source receives data and generates data by a special mechanism and places the data
in batches in one or more channels. There are two types of sources: data-driven (or
event-driven) and polling.

Centro de Inovação EDGE - Big Data Course 3


o Data-driven (event-driven) - the external source ac�vely send data to Flume to
drive Flume to accept the data.

o Polling - Flume periodically obtains data in an ac�ve manner.

� Typical Source types

o Syslog is a source that is integrated with the system. It reads syslog data and
generate Flume events. There are two types of syslog source:

� The UPD source treats an en�re message as a single event;

� and a TCP source creates a new event for each stream of characters sep-
arated by a new line.

o Exec is a source that automa�cally generates events. It runs a certain command


or script and takes the execu�on result as a data source. Exec source runs a
given Unix command on startup and expects that process to con�nuously pro-
duce data. If the process exits for any reason, the sources also exit and will pro-
duce no further data.

o And Avro is an RPC source that is used in the communica�on between agents.
When paired with the built-in Avro sink on previous Flume agent, it can create
�ered collec�on topologies.

� Remember that a source must be associated with at least one channel.

2.2. Channel

Centro de Inovação EDGE - Big Data Course 4


� The Channel is located between the Source and the Sink. The Channel functions similar
to a queue. It temporarily saves events. When the Sink successfully sends events to the
next-hop Channel or the destination, the events are removed from the current Channel.
� Different channels provide different persistence levels:

o Memory channel does not support persistency. The events are stored in an in-
memory queue with configurable map size. It is ideal for flow that need higher
throughput and are prepared to lose the staged data in the event of an agent
failure.

o The persistency of file channel is based on WAL, which is Write-Ahead Log. File
channel supports data persistency to disk, in the local filesystem.

o By default, the File Channel uses a path within the user home folder for both
checkpoint and data directories. As a result, if you have more than one File
Channel instances ac�ve within the Agent, only one will be able to log the direc-
tories and cause the other Channel ini�aliza�on to fail. Therefore, it is neces-
sary that explicit paths are provided to all the configured Channels, preferably
on different disks.

o As for JDBC channel, the events are stored in a persistence storage that is
backed by the database. The JDBC channel concurrently supports embedded
database called Derby. This is a durable channel that is ideal for flows where
recoverability is important.

� Channels support transac�on. When Flume transmits data to the next process over the
channels, if something goes wrong, data will be rolled back and saved in the channels,
wai�ng for the next processing. Channel also provides weak sequence guarantee and
can work with any number of Sources an Sinks.

2.3. Sink

� Sink is responsible for sending data to the next hop of final destination and removing
the data from the channel after successfully sending the data.
� A sink must function with a specific channel.
� Typical Sink types:

o Sinks that send storage data to the final des�na�on, such as HDFS and HBase.

Centro de Inovação EDGE - Big Data Course 5


o HDFS sink writes events into the Hadoop Distributed File System (HDFS). It cur-
rently supports crea�ng text file and Sequence Files. It supports compression in
both file types. The files can be wrote periodically based on the elapsed �me or
size of data or number of events. It also buckets or par��ons data by a�ribute
like �mestamp or machine where the event originated. The HDFS directory
path may contain forma�ng escape sequences that will (be) replaced by the
HDFS sink to generate a directory or filename to store the events. Using this
sink requires Hadoop to be installed so that Flume can use the Hadoop jars
(Java Archive files) to communicate with the HDFS cluster.

o HBase sink writes data to HBase. This sink provides the same consistency guar-
antees as HBase, which is currently row-wise atomicity. In the situa�on of
HBase failing to write certain events, the sink will replay all events in that trans-
ac�on. In addi�on, the HBase sink supports wri�ng data to secure HBase. To
write to secure HBase, the user that the agent is running at must have write
permissions to the table that the sink is configured to write to.

o Avro sink is an RPC sink that is used for communica�on between agents. This
Sink forms one half of Flume's �ered collec�on support. Flume events sent to
this sink are turned into Avro events and are sent to the configured hostname/
port pair. The events are taken from the configured Channel in batches of the
configured batch size.

3. Log Collection

� Flume can collect logs beyond a cluster and save the logs in HDFS, HBase and Kafka for
data cleaning and analysis by upper-layer applications.
� It can be configured to multiple sources, channels and sinks. This example shows three
different sources to three different channels and finally sink to different places, such as
HDFS, HBase and Kafka.

Centro de Inovação EDGE - Big Data Course 6


4. Multi-level Cascading and Multi-Channel Duplication

� Flume also supports cascading of multiple Flume agents as shown in the figure.
� In order to flow the data across multiple agents, the Sink of previous agent and Source
of the current agent need to be Avro type, with the Sink pointing to the host name or
IP address and port of the Source.
� Also, the image shows an example of Channel duplication, where a Source sends events
to two different Channel. In this way, events are duplicated and saved in different repos-
itories.

� Data transmitted between cascaded Flume nodes (or agents) can be compressed or
encrypted. In this case, the data transmission efficiency and security can be improved.
Note that the encryption depends on the storage medium and the Sink type.

o There is no need to encrypt data between Source, Channel and Sink, since ex-
changed data is performed within the same process.

5. Data Monitoring in FusionInsight HD

� Flume supports monitoring indicator presentation over the FusionInsight manager, in-
cluding the received data size of the source, data buffer size of the channel and written
data size of the sink.
� Flume also supports alarms for a channel, buffering failure, data transmission failure
and data receiving failure.

6. Transmission Reliability

� Events are staged in a channel on each agent.


� Then, Events are delivered to the next agent or terminal repository like HDFS in the flow.

Centro de Inovação EDGE - Big Data Course 7


� The events are removed from a Channel only after they are stored in the channel of the
next agent or in the terminal (final) repository. This is how the single agent message
delivery semantics in Flume provides end to end reliability of the flow.
� Flume uses a transactional approach to guarantee the reliable delivery of the events.
The Sources and Sinks encapsulate the Events in a transaction so we can make sure that
the set of events are reliably passed from point to point in the flow.
� In the case of a multi-agent flow, the sink from the previous agent and source from the
next agent both have their transactions running to ensure that the data is safely stored
in the channel of the next agent.
� When data flows from one agent to another agent, the two transactions take effect. So,
sink of agent 1 needs to obtain a message from a channel and send the message to
agent 2. If agent 2 receives and successfully process the message, agent 1 will submit
a transaction. This indicates a successful and reliable data transmission. When agent 2
receives the message sent by agent 1 and starts a new transaction, after the data is
processed successfully, which means written to a channel successfully, agent 2 submits
the transaction and send the successful response to agent 1. Before a commit opera-
tion, if the data transmission fails, the last transaction starts again and retransmits the
data that failed to be transmitted last time. The commit operation has written the trans-
action into a disk. Therefore, the last transaction can continue after the process fails and
restores itself.

6.1. Failover

� During data transmission, if the next-hop Flume node is faulty or receives data abnor-
mally, the data will be automatically switched over to another path.

Centro de Inovação EDGE - Big Data Course 8


7. Data Filtering During Transmission

� Flume also has the capability to roughly filter, clean or even drop unnecessary data dur-
ing data transmission.
� This is done by the help of interceptors. An interceptor can modify or even drop events
based on any criteria chosen by the developer of the interceptor.

o Flume supports chaining of interceptors. This is made by specifying the list of


interceptor names in the configura�on file. Interceptors are specified as a
spacewise separated list in the source configura�on. The order in which the in-
terceptors are specified is the order in which they are invoked. The list of events
returned by one interceptor is passed to the next interceptor in the chain. Inter-
ceptors can modify or drop events. If an interceptor needs to drop events, it
just does not return that event in the list that it returns. If it is to drop all events,
then it simply returns an empty list.

� Channel Selector can also be used to filter data. It can transmit data to different chan-
nels based on events which provides the routing function.

Centro de Inovação EDGE - Big Data Course 9


Centro de Inovação EDGE - Big Data Course 10

You might also like