Flumewithtwitterintegration 140428151459 Phpapp02

CS157B - Big Data Management
Flume with
Twitter
Integration
by Swathi Kotturu
Date: 03/3/2014
Professor: Thanh Tran
ETL Using Flume

What is Flume?
Apache Flume is a distributed service for efficiently
collecting, aggregating, and moving large amounts of
log data.
Flume and its integration with Hadoop and can be used
to capture streaming twitter data which can be filtered
based on keywords and locations..
More About Flume

It has a very simple architecture based on streaming data flows.
Flume takes a source and processes it through a memory
channel, where the data gets filtered and sinks into the HDFS.
Flume Agents
Flume can deploy any number of agents. An Agent is a
container for Flume data flow. It can run any number of
sources, sinks, and channels.
It must have a source, channel, and sink.
Flume Sources
Sources are not Necessarily restricted to log data.
It is possible to use Flume to transport event data such as
network traffic data, social-media-generated data,
e-mail messages, etc
The events can be HTTP POSTS, RPC calls, strings in
stdout, etc.
After an event occurs, Flume sources write the event to a
channel as a transaction.
Flume Channels
Channels are internal passive stores with specific
characteristics. This allows a source and a sink to run
asynchronously.
Two Main Types of Channels
Memory Channels
- Volatile Channel that buffers events in memory
only. If JVM crashes, all data is lost.
File Channels
- Persistant Channel that is stored to disk.
You can Run Multiple Agents and Servers to collect data in

parallel.
Get Twitter Access
Flume in Cloudera
Download flume-sources-1.0-SNAPSHOT.jar and add it to the
flume class path. http://files.cloudera.com/samples/flume-sources1.0-SNAPSHOT.jar
In the Cloudera Manager, you can add the class path:
Services -> flume1 -> Configuration -> Agent(Default) ->
Advanced -> Java Configuration Options for Flume Agent, add:
classpath /opt/cloudera/parcels/CDH-4.3.01.cdh4.3.0.p0.22/lib/flume-ng/lib/flume-sources-1.0SNAPSHOT.jar
Flume in Cloudera (cont.)

You also have to exclude the original file that
came with Flume, pre-installed by renaming it
.org. The file is search-contrib-1.0.0-jar-withdependencies.jar and is in the /usr/lib/flumeng/lib/ path.
mv search-contrib-1.0.0-jar-withdependencies.jar search-contrib-1.0.0-jarwith-dependencies.jar.org
Using Hue, create user Flume and give them
access to read and write in hdfs.

From the Cloudera Manager, go to
Services -> flume1 -> Configuration ->
Agent(Default) -> Agent Name.
Set the Agent Name to Twitter Agent

Also set the Configuration File to the following and make sure to replace the
ConsumerKey, ConsumerSecret, AccessToken, AccessTokenSecret
Also set the Configuration File to the following and make sure to replace the
ConsumerKey, ConsumerSecret, AccessToken, AccessTokenSecret
TwitterAgent.sources = Twitter
TwitterAgent.channels = MemChannel
TwitterAgent.sinks = HDFS
TwitterAgent.sources.Twitter.type =
com.cloudera.flume.source.TwitterSource
TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sources.Twitter.consumerKey = <consumer key>
TwitterAgent.sources.Twitter.consumerSecret = <consumer secret>
TwitterAgent.sources.Twitter.accessToken = <access token>
TwitterAgent.sources.Twitter.accessTokenSecret = <access token
secret>

TwitterAgent.sources.Twitter.keywords = flu, runny nose,
tissue, sick, ill, cough
TwitterAgent.sinks.HDFS.channel = MemChannel
TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.path =
hdfs://localhost:8020/user/flume/tweets/
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text
TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000
TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.capacity = 10000
TwitterAgent.channels.MemChannel.transactionCapacity = 100

Restart Flume Agent
Example Tweet
We loaded raw tweets into HDFS which are represented
as chunks of JSON
Next Steps
Tell Hive how to read the data
You will need

Hive-serdes-1.0-SNAPSHOT.jar
http://files.cloudera.com/samples/hive-serdes1.0-SNAPSHOT.jar
As Hive is setup to read delimited

row format but in this case needs to
read json.
Flume Resources
Learn More
https://dev.twitter.com/docs/streamingapis/parameters
https://cwiki.apache.org/confluence/display/FLUME/
Home
http://blog.cloudera.com/blog/2012/09/analyzingtwitter-data-with-hadoop/
Thank you!
Q/A

Flumewithtwitterintegration 140428151459 Phpapp02

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Flumewithtwitterintegration 140428151459 Phpapp02

Uploaded by

Copyright:

Available Formats

CS157B - Big Data Management

ETL Using Flume

More About Flume

You can Run Multiple Agents and Servers to collect data in

Get Twitter Access

Flume in Cloudera (cont.)

Flume in Cloudera (cont.)

Flume in Cloudera (cont.)

Flume in Cloudera (cont.)

Flume in Cloudera (cont.)

Flume in Cloudera (cont.)

Flume in Cloudera (cont.)

Flume in Cloudera (cont.)

Flume in Cloudera (cont.)

Flume in Cloudera (cont.)

Flume in Cloudera (cont.)

You will need

As Hive is setup to read delimited

You might also like