You are on page 1of 17

Apache Flume

By :
• Madhusree Prakash
• Titikshya Behra
• Yamini Patel
• Meetali Saxena
What is Apache Flume?

Hadoop’s history and advantages

Introduction of Flume

Why Apache Flume?

Flume Features

Flume Architecture
• Apache flume is an open source data
collection service for moving the data
from source to destination
It is a tool used to collect, aggregate
and transports large amounts of
streaming data like log files, events,
etc., from a number of different
What is sources to a centralized data store (say
Apache Hadoop Distributed File System –
HDFS).
Flume? It is , highly distributed, reliable, and
configurable tool. Flume was mainly
designed in order to collect streaming
data (log data) from various web
servers to Hadoop Distributed File
System (HDFS)
• Hadoop is designed to answer the
question: “How to process big data
with reasonable cost and time?”
• Some applications of Hadoop
Advertisement (Mining user
behaviour to generate
recommendations)
Searches (group related
documents)
Security (search for uncommon
patterns)
Challenges with big data
Search
engines in
1990s
Google search engines

1998 2013
Hadoop Framework Tools
Hadoop’s Developers

• 2005: Doug Cutting


and Michael J. Cafarella
developed Hadoop to
support distribution for
the Nutch search engine
project. The project was
funded by Yahoo.
• 2006: Yahoo gave the
project to Apache
• Apache Flume is a tool/service/data ingestion
mechanism for collecting aggregating and
transporting large amounts of streaming data such as
log files, events (etc...) from various sources to a
centralized data store.
Why Apache Flume? • Flume is a highly reliable, distributed, and
configurable tool. It is principally designed to copy
streaming data (log data) from various web servers
to HDFS.
Flume is used to move the log data
generated by application servers
into HDFS at a higher speed.
Assume an e-commerce web
Applications application wants to analyse the
customer behaviour from a
of Flume particular region. To do so, they
would need to move the
available log data in to Hadoop
for analysis. Here, Apache
Flume comes to our rescue.
• Using Apache Flume we can store the data
in to any of the centralized stores
(HBase,HDFS).
• When the rate of incoming data exceeds the
rate at which data can be written to the
destination, Flume acts as a mediator
between data producers and the centralized
Advantages of stores and provides a steady flow of data
between them.
Flume • Flume provides the feature of contextual
routing.
• The transactions in Flume are channel-based
where two transactions (one sender and one
receiver) are maintained for each message.
• It guarantees reliable message delivery.
• Flume is reliable, fault tolerant, scalable,
manageable, and customizable.
• Flume ingests log data from multiple
web servers into a centralized store
(HDFS,HBase) efficiently.
• Using Flume, we can get the data from
multiple servers immediately into
Hadoop.
• Along with the log files, Flume is also
used to import huge volumes of event
data produced by social networking
Features of sites like Facebook and Twitter, and e-
Flume commerce websites like Amazon and
Flipkart.
• Flume supports a large set of sources
and destinations types.
• Flume supports multi-hop flows, fan-
in fan-out flows, contextual routing,
etc.
• Flume can be scaled horizontally.
Basic architecture of Flume.
FLUME – ARCHITECTURE

Event: A single data record entry transported by flume is known as Event.

Client: Events generates by the clients and send to one or more agents.

Agent: An Agent is a Java virtual machine in which Flume runs. It consists of sources, sinks, channels and other
important components through which events can transfer from one place to another.
Source: Source is an active component that listens for events and writes them on one or channels. It is the main
Component with the help of which data enters into the Flume. It collects the data from a variety of sources, like
exec, avro, JMS, spool directory, etc.
Channel: It is a conduit between the Source and the Sink that queues event data as transactions. Events are
ingested by the sources into the channel and drained by the sinks from the channel.

Sink: It is that component which removes events from a channel and delivers data to the destination or next hop.
There are multiple sinks available that delivers data to a wide range of destinations. Example: HDFS, HBase,
logger, null, file, etc.
Thank you

You might also like