Professional Documents
Culture Documents
Flume
AVISO
Este documento foi gerado a partir de um material de estudo
da Huawei. Considere as informações nesse documento
como material de apoio.
3. Log Collection................................................................................................. 6
6. Transmission Reliability.................................................................................. 7
� Flume is an open-source log system, which is used for collecting, processing and trans-
ferring data. It is a distributed highly reliable and highly available massive log aggrega-
tion system.
� Flume supports customized data transmitters for collecting data. Flume also roughly
processes data and writes data to customizable data receivers. Flume is applicable to
the scenarios of efficiently collecting, aggregating and moving large amounts of log
data from many different sources to a centralized data store.
� The use of Flume is not only restricted to log data aggregation. Since data sources are
customizable, Flume can be used to transport massive quantities of event data includ-
ing, but not limited, to network traffic data, social media generated data, email mes-
sages and pretty much any possible data source.
� Collect logs from a specified directory and save the logs to HDFS, HBase or Kafka.
� Also, Flume can collect and save logs to a specified path in real time.
� Flume supports the cascading mode, which means multiple Flume nodes interwork
with each other for aggregating data.
� Besides, users can customize the data collection (collect customizable data).
2. Architecture of Flume
� A Flume event is defined as a unit of data flow having a byte payload and an optional
set of stream attributes.
� A Flume Agent is a process that hosts the components through which events flow from
external source to the next destination.
o When a Flume source receives an event, it stores it into one or more channels.
o The channel is a passive store that keeps the event un�l it is consumed by a
Flume sink.
o The sink removes the event from the channel and puts it into an external repos-
itory like HDFS or forward it to the Flume source of the next Flume Agent in the
flow.
� Flume allows users to build multi-agent flows where events travel through multiple
Agents before reaching the final destination.
2.1. Source
� A source receives data and generates data by a special mechanism and places the data
in batches in one or more channels. There are two types of sources: data-driven (or
event-driven) and polling.
o Syslog is a source that is integrated with the system. It reads syslog data and
generate Flume events. There are two types of syslog source:
� and a TCP source creates a new event for each stream of characters sep-
arated by a new line.
o And Avro is an RPC source that is used in the communica�on between agents.
When paired with the built-in Avro sink on previous Flume agent, it can create
�ered collec�on topologies.
2.2. Channel
o Memory channel does not support persistency. The events are stored in an in-
memory queue with configurable map size. It is ideal for flow that need higher
throughput and are prepared to lose the staged data in the event of an agent
failure.
o The persistency of file channel is based on WAL, which is Write-Ahead Log. File
channel supports data persistency to disk, in the local filesystem.
o By default, the File Channel uses a path within the user home folder for both
checkpoint and data directories. As a result, if you have more than one File
Channel instances ac�ve within the Agent, only one will be able to log the direc-
tories and cause the other Channel ini�aliza�on to fail. Therefore, it is neces-
sary that explicit paths are provided to all the configured Channels, preferably
on different disks.
o As for JDBC channel, the events are stored in a persistence storage that is
backed by the database. The JDBC channel concurrently supports embedded
database called Derby. This is a durable channel that is ideal for flows where
recoverability is important.
� Channels support transac�on. When Flume transmits data to the next process over the
channels, if something goes wrong, data will be rolled back and saved in the channels,
wai�ng for the next processing. Channel also provides weak sequence guarantee and
can work with any number of Sources an Sinks.
2.3. Sink
� Sink is responsible for sending data to the next hop of final destination and removing
the data from the channel after successfully sending the data.
� A sink must function with a specific channel.
� Typical Sink types:
o Sinks that send storage data to the final des�na�on, such as HDFS and HBase.
o HBase sink writes data to HBase. This sink provides the same consistency guar-
antees as HBase, which is currently row-wise atomicity. In the situa�on of
HBase failing to write certain events, the sink will replay all events in that trans-
ac�on. In addi�on, the HBase sink supports wri�ng data to secure HBase. To
write to secure HBase, the user that the agent is running at must have write
permissions to the table that the sink is configured to write to.
o Avro sink is an RPC sink that is used for communica�on between agents. This
Sink forms one half of Flume's �ered collec�on support. Flume events sent to
this sink are turned into Avro events and are sent to the configured hostname/
port pair. The events are taken from the configured Channel in batches of the
configured batch size.
3. Log Collection
� Flume can collect logs beyond a cluster and save the logs in HDFS, HBase and Kafka for
data cleaning and analysis by upper-layer applications.
� It can be configured to multiple sources, channels and sinks. This example shows three
different sources to three different channels and finally sink to different places, such as
HDFS, HBase and Kafka.
� Flume also supports cascading of multiple Flume agents as shown in the figure.
� In order to flow the data across multiple agents, the Sink of previous agent and Source
of the current agent need to be Avro type, with the Sink pointing to the host name or
IP address and port of the Source.
� Also, the image shows an example of Channel duplication, where a Source sends events
to two different Channel. In this way, events are duplicated and saved in different repos-
itories.
� Data transmitted between cascaded Flume nodes (or agents) can be compressed or
encrypted. In this case, the data transmission efficiency and security can be improved.
Note that the encryption depends on the storage medium and the Sink type.
o There is no need to encrypt data between Source, Channel and Sink, since ex-
changed data is performed within the same process.
� Flume supports monitoring indicator presentation over the FusionInsight manager, in-
cluding the received data size of the source, data buffer size of the channel and written
data size of the sink.
� Flume also supports alarms for a channel, buffering failure, data transmission failure
and data receiving failure.
6. Transmission Reliability
6.1. Failover
� During data transmission, if the next-hop Flume node is faulty or receives data abnor-
mally, the data will be automatically switched over to another path.
� Flume also has the capability to roughly filter, clean or even drop unnecessary data dur-
ing data transmission.
� This is done by the help of interceptors. An interceptor can modify or even drop events
based on any criteria chosen by the developer of the interceptor.
� Channel Selector can also be used to filter data. It can transmit data to different chan-
nels based on events which provides the routing function.