Professional Documents
Culture Documents
Apache Flume?
Flume is a distributed, reliable, and available service for efficiently
collecting, aggregating, and moving large amounts of log data. In
simple words we can say that Flume pulls data continuously from
different sources. It imports unstructured and streaming data and
dumps it to a centralised store. It has a simple and flexible architecture
based on streaming data flows. It is robust and fault tolerant with
tuneable reliability mechanisms and many failover and recovery
mechanisms. It uses a simple extensible data model that allows for online
analytic application. Flume supports the following mechanisms to read
data from popular log stream types, such as:
1. Avro
2. Thrift
3. Syslog
4. Netcat
Components of Flume:
Basic working of Flume is based on some components
which are as below:
1. Agent
2. Event
3. Source
4. Channel
5. Sink
1. Agent
An agent is an independent daemon process (JVM) in Flume. It receives
the data (events) from clients or other agents and forwards it to its next
destination (sink or agent). Flume may have more than one agent.
Following diagram represents a Flume Agent
3. Source
A source is the component of an Agent which receives data from the
data generators and transfers it to one or more channels in the form of
Flume events. Apache Flume supports several types of sources and each
source receives events from a specified data generator.
Example − Avro source, Thrift source, twitter 1% source etc.
4. Channel
A channel is a transient store which receives the events from the source
and buffers them till they are consumed by sinks. It acts as a bridge
between the sources and the sinks. These channels are fully transactional
and they can work with any number of sources and sinks.
Example − JDBC channel, File system channel, Memory channel, etc.
5. Sink
A sink stores the data into centralized stores like HBase and HDFS. It
consumes the data (events) from the channels and delivers it to the
destination. The destination of the sink might be another agent or the
central stores.
Example − HDFS sink
Working of Flume:
agent name
source name
channel name
sink name
source type
sink type
Explanation:
The diagram below depicts all the steps and explanation that how to write
the configuration file and run flume agent to pull the data and put it into
centralised store or directory where we want to store that pulled data, so
that further operations can be performed on that data.
This configuration defines a single agent named a1. a1 has a source that
listens for data on port 44440, a channel that buffers event data in
memory, and a sink that logs event data to the console. The configuration
file names the various components, then describes their types and
configuration parameters. A given configuration file might define several
named agents; when a given Flume process is launched a flag is passed
telling it which named agent to manifest.
Other Features supported by Flume:
1. Failure Handling
In Flume, for each event, two transactions take place: one at the sender
and one at the receiver. The sender sends events to the receiver. Soon
after receiving the data, the receiver commits its own transaction and
sends a “received” signal to the sender. After receiving the signal, the
sender commits its transaction. (Sender will not commit its transaction
till it receives a signal from the receiver.)
Generally events and log data are generated by the log servers and
these servers have Flume agents running on them. These agents receive
the data from the data generators.
The data in these agents will be collected by an intermediate node known
as Collector. Just like agents, there can be multiple collectors in Flume.
Finally, the data from all these collectors will be aggregated and pushed
to a centralized store such as HBase or HDFS. The following diagram
explains the data flow in Flume.
2. Reliability
In apache flume, the sources transfer events through the channel. The
flume source puts events in the channel which are then consumed by the
sink. The sink transfers the event to the next agent or to the terminal
repository (like HDFS).
The events in the flume channel are removed only when they are stored
in the next agent channel or in the terminal repository.
In this way, the single-hop message delivery semantics in Apache Flume
caters to end-to-end reliability of the flow. Flume uses a transactional
approach for guaranteeing reliable delivery of the flume events.
3. Recoverability
The flume events are staged in a flume channel on each flume agent.
This manages recovery from failure. Also, Apache Flume supports a
durable File channel. File channels can be backed by the local file system.
Comparison