You are on page 1of 24

CS157B - Big Data Management

Flume with
Twitter
Integration
by Swathi Kotturu

Date: 03/3/2014
Professor: Thanh Tran

ETL Using Flume


What is Flume?
Apache Flume is a distributed service for efficiently
collecting, aggregating, and moving large amounts of
log data.
Flume and its integration with Hadoop and can be used
to capture streaming twitter data which can be filtered
based on keywords and locations..

More About Flume


It has a very simple architecture based on streaming data flows.
Flume takes a source and processes it through a memory
channel, where the data gets filtered and sinks into the HDFS.

Flume Agents
Flume can deploy any number of agents. An Agent is a
container for Flume data flow. It can run any number of
sources, sinks, and channels.
It must have a source, channel, and sink.

Flume Sources
Sources are not Necessarily restricted to log data.
It is possible to use Flume to transport event data such as
network traffic data, social-media-generated data,
e-mail messages, etc
The events can be HTTP POSTS, RPC calls, strings in
stdout, etc.
After an event occurs, Flume sources write the event to a
channel as a transaction.

Flume Channels
Channels are internal passive stores with specific
characteristics. This allows a source and a sink to run
asynchronously.
Two Main Types of Channels
Memory Channels
- Volatile Channel that buffers events in memory
only. If JVM crashes, all data is lost.
File Channels
- Persistant Channel that is stored to disk.

You can Run Multiple Agents and Servers to collect data in


parallel.

Get Twitter Access

Flume in Cloudera
Download flume-sources-1.0-SNAPSHOT.jar and add it to the
flume class path. http://files.cloudera.com/samples/flume-sources1.0-SNAPSHOT.jar
In the Cloudera Manager, you can add the class path:
Services -> flume1 -> Configuration -> Agent(Default) ->
Advanced -> Java Configuration Options for Flume Agent, add:
classpath /opt/cloudera/parcels/CDH-4.3.01.cdh4.3.0.p0.22/lib/flume-ng/lib/flume-sources-1.0SNAPSHOT.jar

Flume in Cloudera (cont.)

Flume in Cloudera (cont.)


You also have to exclude the original file that
came with Flume, pre-installed by renaming it
.org. The file is search-contrib-1.0.0-jar-withdependencies.jar and is in the /usr/lib/flumeng/lib/ path.
mv search-contrib-1.0.0-jar-withdependencies.jar search-contrib-1.0.0-jarwith-dependencies.jar.org
Using Hue, create user Flume and give them
access to read and write in hdfs.

Flume in Cloudera (cont.)

Flume in Cloudera (cont.)


From the Cloudera Manager, go to
Services -> flume1 -> Configuration ->
Agent(Default) -> Agent Name.
Set the Agent Name to Twitter Agent

Flume in Cloudera (cont.)

Flume in Cloudera (cont.)


Also set the Configuration File to the following and make sure to replace the
ConsumerKey, ConsumerSecret, AccessToken, AccessTokenSecret
Also set the Configuration File to the following and make sure to replace the
ConsumerKey, ConsumerSecret, AccessToken, AccessTokenSecret
TwitterAgent.sources = Twitter
TwitterAgent.channels = MemChannel
TwitterAgent.sinks = HDFS
TwitterAgent.sources.Twitter.type =
com.cloudera.flume.source.TwitterSource
TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sources.Twitter.consumerKey = <consumer key>
TwitterAgent.sources.Twitter.consumerSecret = <consumer secret>
TwitterAgent.sources.Twitter.accessToken = <access token>
TwitterAgent.sources.Twitter.accessTokenSecret = <access token
secret>

Flume in Cloudera (cont.)


TwitterAgent.sources.Twitter.keywords = flu, runny nose,
tissue, sick, ill, cough
TwitterAgent.sinks.HDFS.channel = MemChannel
TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.path =
hdfs://localhost:8020/user/flume/tweets/
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text
TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000
TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.capacity = 10000
TwitterAgent.channels.MemChannel.transactionCapacity = 100

Flume in Cloudera (cont.)

Flume in Cloudera (cont.)


Restart Flume Agent

Flume in Cloudera (cont.)

Flume in Cloudera (cont.)

Example Tweet
We loaded raw tweets into HDFS which are represented
as chunks of JSON

Next Steps
Tell Hive how to read the data

You will need


Hive-serdes-1.0-SNAPSHOT.jar
http://files.cloudera.com/samples/hive-serdes1.0-SNAPSHOT.jar

As Hive is setup to read delimited


row format but in this case needs to
read json.

Flume Resources
Learn More
https://dev.twitter.com/docs/streamingapis/parameters
https://cwiki.apache.org/confluence/display/FLUME/
Home
http://blog.cloudera.com/blog/2012/09/analyzingtwitter-data-with-hadoop/

Thank you!
Q/A

You might also like