You are on page 1of 8

What is 

Apache Flume?
Flume is a distributed, reliable, and available service for efficiently
collecting, aggregating, and moving large amounts of log data. In
simple words we can say that Flume pulls data continuously from
different sources. It imports unstructured and streaming data and
dumps it to a centralised store. It has a simple and flexible architecture
based on streaming data flows. It is robust and fault tolerant with
tuneable reliability mechanisms and many failover and recovery
mechanisms. It uses a simple extensible data model that allows for online
analytic application. Flume supports the following mechanisms to read
data from popular log stream types, such as:
1. Avro
2. Thrift
3. Syslog
4. Netcat

Components of Flume:
Basic working of Flume is based on some components
which are as below:

1. Agent
2. Event
3. Source
4. Channel
5. Sink

1. Agent
An agent is an independent daemon process (JVM) in Flume. It receives
the data (events) from clients or other agents and forwards it to its next
destination (sink or agent). Flume may have more than one agent.
Following diagram represents a Flume Agent

As shown in the diagram a Flume Agent contains three main components


namely, source, channel, and sink.
2. Event
An event is the basic unit of the data transported inside Flume. It
contains a payload of byte array that is to be transported from the
source to the destination accompanied by optional headers. A typical
Flume event would have the following structure −

3. Source
A source is the component of an Agent which receives data from the
data generators and transfers it to one or more channels in the form of
Flume events. Apache Flume supports several types of sources and each
source receives events from a specified data generator.
Example − Avro source, Thrift source, twitter 1% source etc.

4. Channel
A channel is a transient store which receives the events from the source
and buffers them till they are consumed by sinks. It acts as a bridge
between the sources and the sinks. These channels are fully transactional
and they can work with any number of sources and sinks.
Example − JDBC channel, File system channel, Memory channel, etc.
5. Sink
A sink stores the data into centralized stores like HBase and HDFS. It
consumes the data (events) from the channels and delivers it to the
destination. The destination of the sink might be another agent or the
central stores.
Example − HDFS sink

Working of Flume:

Flume works on the concept of events and agents. A Flume


event is defined as a unit of data flow having a byte payload and an
optional set of string attributes. A Flume agent is a (JVM) process that
hosts the components through which events flow from an external source
to the next destination (hop).
A Flume source consumes events delivered to it by an external source like
a web server. The external source sends events to Flume in a format that
is recognized by the target Flume source. For example, an Avro Flume
source can be used to receive Avro events from Avro clients or other
Flume agents in the flow that send events from an Avro sink. A similar
flow can be defined using a Thrift Flume Source to receive events from a
Thrift Sink or a Flume Thrift Rpc Client or Thrift clients written in any
language generated from the Flume thrift protocol. When a Flume
source receives an event, it stores it into one or more channels.
The channel is a passive store that keeps the event until it’s consumed by
a Flume sink. The file channel is one example – it is backed by the local
filesystem. The sink removes the event from the channel and puts it into
an external repository like HDFS (via Flume HDFS sink) or forwards it to
the Flume source of the next Flume agent (next hop) in the flow. The
source and sink within the given agent run asynchronously with the
events staged in the channel.

The following illustration depicts the basic architecture of Flume. As


shown in the illustration, data generators (such as Facebook, Twitter)
generate data which gets collected by individual Flume agents running
on them. Thereafter, a data collector (which is also an agent) collects
the data from the agents which is aggregated and pushed into a
centralized store such as HDFS or HBase.
NOTE: We can Say that we can use put command to transfer files but:
 Using put command, we can transfer only one file at a time while
the data generators generate data at a much higher rate. Since the
analysis made on older data is less accurate, we need to have a
solution to transfer data in real time.
 If we use put command, the data is needed to be packaged and
should be ready for the upload. Since the webservers generate
data continuously, it is a very difficult task.

Set Up: Configuration


Flume agent configuration is stored in a local configuration file. This is a
text file that follows the Java properties file format. Configurations for
one or more agents can be specified in the same configuration file. The
configuration file includes properties of each source, sink and channel in
an agent and how they are wired together to form data flows.
The agent needs to know what individual components to load and how
they are connected in order to constitute the flow. This is done by listing
the names of each of the sources, sinks and channels in the agent, and
then specifying the connecting channel for each sink and source.
For example, an agent flows events from an Avro source called avroWeb
to HDFS sink hdfs-cluster1 via a file channel called file-channel. The
configuration file will contain names of these components and file-
channel as a shared channel for both avroWeb source and hdfs-cluster1
sink.
Run a Flume Agent:
We need to follow these steps for running a flume agent, that agent will
fetch the data from the source mentioned by us in the configuration file.
Steps:
1. Write a config file in which we define the basic parameters relating to
agent source channel type used and sink.
2. Mention port number of local host which we want to listen or get data
from.
3. Start an agent using “flume –ng agent …….” command
4. Thus we can fetch data using flume. Following pictures depict the
process to use flume for fetching data.
5. Basic parameters for a config file are as follows:

agent name

source name

channel name

sink name

source type

source miscellaneous properties

sink type

Explanation:
The diagram below depicts all the steps and explanation that how to write
the configuration file and run flume agent to pull the data and put it into
centralised store or directory where we want to store that pulled data, so
that further operations can be performed on that data.
This configuration defines a single agent named a1. a1 has a source that
listens for data on port 44440, a channel that buffers event data in
memory, and a sink that logs event data to the console. The configuration
file names the various components, then describes their types and
configuration parameters. A given configuration file might define several
named agents; when a given Flume process is launched a flag is passed
telling it which named agent to manifest.
Other Features supported by Flume:

1. Failure Handling
In Flume, for each event, two transactions take place: one at the sender
and one at the receiver. The sender sends events to the receiver. Soon
after receiving the data, the receiver commits its own transaction and
sends a “received” signal to the sender. After receiving the signal, the
sender commits its transaction. (Sender will not commit its transaction
till it receives a signal from the receiver.)
Generally events and log data are generated by the log servers and
these servers have Flume agents running on them. These agents receive
the data from the data generators.
The data in these agents will be collected by an intermediate node known
as Collector. Just like agents, there can be multiple collectors in Flume.
Finally, the data from all these collectors will be aggregated and pushed
to a centralized store such as HBase or HDFS. The following diagram
explains the data flow in Flume.

2. Reliability
In apache flume, the sources transfer events through the channel. The
flume source puts events in the channel which are then consumed by the
sink. The sink transfers the event to the next agent or to the terminal
repository (like HDFS).
The events in the flume channel are removed only when they are stored
in the next agent channel or in the terminal repository.
In this way, the single-hop message delivery semantics in Apache Flume
caters to end-to-end reliability of the flow. Flume uses a transactional
approach for guaranteeing reliable delivery of the flume events.

3. Recoverability
The flume events are staged in a flume channel on each flume agent.
This manages recovery from failure. Also, Apache Flume supports a
durable File channel. File channels can be backed by the local file system.
Comparison

Flume Sqoop HDFS


HDFS is the distributed
Apache Flume is designed for moving Apache Sqoop is designed for importing file system used by
bulkier streaming data into the HDFS.  data from relational databases to HDFS.  Apache Hadoop for data
storing. 
It has an agent-based architecture. In It has a distributed
It has a connector-based architecture. A
Flume, the code is written (called as architecture. The data is
Connector will know how to connect to the
‘agent’) that takes care of the data distributed across
data source and how to fetch the data.
fetching.  commodity hardware.
HDFS is a final
In Flume, the data flows via zero or HDFS is the destination for importing data
destination for data
more channels to the HDFS.  using Sqoop.
storage.
The Apache Flume data load is driven The Apache Sqoop data load is not event- It just stores the data
by an event. driven.  provided by any means. 
For loading streaming data like web For importing data from the structured data
HDFS has built-in shell
servers log files or tweets generated on sources we have, to use Sqoop only because
commands for storing
Twitter, we have to use Flume because Sqoop connectors know how to interact
data into it. It cannot
flume agents were designed for fetching with the structured data sources and how to
import streaming data. 
streaming data. fetch data from them. 

You might also like