COLLABORATIVE WHITE PAPER SERIES

The Fast-Track to Hands-On
Understanding of Big Data
Technology
Stream Social Media to Hadoop & Create
Reports in Less Than a Day

COLLABORATIVE WHITE PAPER SERIES

The Fast-Track to Hands-On Understanding of
Big Data Technology
Stream Social Media to Hadoop & Create Reports in Less Than a Day

Big Data might be intimidating to the most seasoned IT professional. It’s not simply the charged
nature of the term “Big” that is ominous, but the underlying technology is app-centric in a
very open-source way. If you are like most professionals who don’t have a working knowledge
of MapReduce, JSON, Hive, or Flume, diving into the deep-end of the Big Data technology
pool may seem like a time-consuming process. Even if you possess these skill sets, the prospect
of launching a Hadoop environment and deploying an application that streams Twitter data
into the environment in a way that is accessible through standard ODBC tools would seem like
a task measured in weeks not days.
It may surprise most people looking to get hands-on with Big Data technology that each of us
can do so in short time and with the right approach, you can stream live social data to your
own Hadoop cluster and report on the information through Excel in less than one day. In an
instructive manner, this whitepaper series enables you with a “fast track” approach to create
your personal Big Data lab environment powered by Apache Hadoop. This first part in the
series will engage IT professionals with a passing interest in Big Data by providing them with:
• Reasons to explore the world of Big Data and Big Data skills gap
• A practical, lightweight approach to getting hands-on with Big Data technology
• Describe the use case and the supporting technical components in more detail
• Provide step-by-step instructions of how to setup the lab environment, and direct individuals
to Cloudera’s streaming Twitter agent tutorial.
• We will enhance Cloudera’s tutorial in the following ways:
-- Make the tutorial real-time.
-- Provide steps to establish ODBC connectivity and how to execute Cloudera’s sample
queries in Excel.
-- Configure and register libraries at an overall environment level.
-- Provide sample code and troubleshooting tips.

the approach relies heavily on Apache Hadoop. From a technical training perspective. etc. The use case will be described in more detail later on. the “Internet of Things”). this means that business people will be asking IT for answers to questions that can be supported by sources of Big Data. there aren’t enough hours in the day to invest a lot of time in dissecting the various Big Data technology players. and then we will report on the data through everyone’s favorite BI tool. in most cases you need to formulate a practical. low-cost. this means that 95. After all. it makes sense to pursue a fairly common scenario across industries. The Fast-Track to Hands-On Understanding of Big Data Technology 3 . the first question that may be asked by IT professionals is why would one care to explore the universe of Big Data? The fact is that the universe of data is expanding at an accelerating rate. Organizations are increasingly aware that this unrefined data represents an opportunity to gain valuable insight into ongoing clinical research. A recent Harvard Business Review survey suggests that 85% of organizations had funded Big Data initiatives in process or in the planning stage. and increasingly the data growth is driven by sources of unstructured or machine-generated Big Data (e. logs. The latest IDC Digital Universal Study reveals an explosion of stored information: more than 2. and practical ways with which you can get lab environment running so that you can mobilize business sponsors and technical staff around Big Data capabilities in less than eight hours. from social media.07 terabytes of information was produced per second over the course of a year. the following game plan provides a universal use case as a starting point. For those of us with day jobs. and IT’s ability to support the business will be questioned. IT resources are stretched thin. This gap must be filled in short time otherwise businesses will find themselves at a competitive disadvantage. From a practical perspective.g. Thus Big Data introduces an opportunity to the business. and many of us in IT that are new to the world of Big Data could spend weeks getting up to speed on the various options before taking the first step.8 zettabytes—or roughly 3 billion terabytes—of information was created and replicated in 2012 alone. but the survey reveals a severe gap in analytical skills and 70% of respondents describe finding qualified data scientists as “challenging” to “very difficult”1. Fortunately. or building relevant open source components that won’t necessarily prove anything from a technology or business point of view. Excel. The right approach If you are convinced that an understanding of Big Data is important to your business and IT initiatives. and ultimately relevant approach of understanding the technology and conventional use cases that resonate with the business.A reason to explore the universe of Big Data Before beginning this exercise. monitoring financial risk. In our case we will attempt to stream social media (specifically tweets from Twitter) to a lab environment. To put this number in perspective. and in some cases processed by Big Data technologies. but exposes a skills and technology gap for IT. For learning purposes.

• NoSQL databases like MongoDB could feasibly handle documents like the data streaming from social media sites (i. Hadoop is one case where the OSS community provides unparalleled processing power through its core HDFS and MapReduce projects. then you should include this step in the approach. machine data. but resources and communities supporting these proprietary platforms are limited in comparison to Apache Hadoop.The following lists the reasons why Hadoop is the preferred platform to learn Big Data and to implement this scenario: Why Hadoop • • • Other Big Data options For skeptics that believe “open-source is • not free”. e. Cloudera also has VM images with a free edition of the Cloudera manager and Hadoop available with the entire Apache Hadoop project required by the scenario for download. The Apache projects supporting Hadoop provides all of the capabilities inherent to the real-time streaming social media example. An instance of Microsoft HDInsight was used running in Azure. Cloudera’s software and support resources and its Twitter Feed example. Specialty Big Data technologies like Splunk serve a very specific purpose.g. or if you need to tweak the source code. and would have been pursued at a greater length but unfortunately the lease on the Azure instance expired and inquiries on how to extend the lease went unanswered. Using CDH over HortonWorks or Alternatives from HortonWorks and MapR were considered. Deploying the environment to the cloud was considered. specifically Microsoft’s HDInsight distribution that uses HortonWorks. what it takes to ensure compatibility of each Hadoop project. which are available for download and general use. If you have a need to learn Big Data with a use case that involves machine data. it seemed more useful to take an existing distribution that ensured interoperability and compatibility of the projects. and in some cases it may be preferred. real-time streaming). then consider downloading Splunk. The following lists key considerations and why CDH was used. Deploying Hadoop in the Cloud. The Fast-Track to Hands-On Understanding of Big Data Technology 4 . Our example relies on Cloudera’s distribution of Apache Hadoop (CDH) running in a Linux VM image. Consideration Direction and rationale Building the Hadoop environment We considered building our Hadoop environment from scratch through the from scratch as opposed to using Apache Hadoop projects.e. Hadoop adoption is setting the standard for overall Big Data adoption. but not from a query and analysis perspective. broad capabilities (e. thereby substantiating investment in Hadoop skills. If your learning objectives include understanding CDH. After several test-drives with MongoDB.g. the technology relied heavily on JavaScript which makes sense from a document-store perspective. There are many ways to deploy Apache Hadoop. and most importantly an active community of developers and users • providing sample code and workarounds to wrinkles inherent to the world of OSS. JSON). Proprietary Big Data platforms from Google (BigQuery) and Amazon (DynamoDB) may make sense for IT shops that have already committed to these vendors. Given the time commitment. Ultimately MapR.

Since much of the classification would be dependent on business input. since streaming live social data will generate millions of transactions. The business use case would still be relevant. you could use Apache Mahout to cluster and classify the data using the various algorithms available. Finally. Thus a single-node Hadoop running inside of a Linux VM is deemed sufficient for those of us wanting to learn Hadoop. The specifications of the VM image and the host matching are listed in the Appendix. once the data streaming is captured. For instance. not a performance benchmark. there are ways to make this use case more comprehensive.The CDH stack (listed below) summarizes the core projects included with CDH. then a more robust environment would be required. it seems reasonable to take a first iteration through the use case as presented. depending on the key words you have specified. Figure 1: Cloudera’s distribution including Apache Hadoop (CDH)3 It should be noted that this is a learning exercise. and then proceed with next steps in concert with more involvement or direction from the business. The Fast-Track to Hands-On Understanding of Big Data Technology 5 . If performance tuning is crucial to your learning objectives. and the projects relevant to the use case are captioned2.

you will have Tweets streaming into your Hadoop sandbox reportable in Excel in less than a business day. Figure 2: Streaming social media use case and supporting technical components The Fast-Track to Hands-On Understanding of Big Data Technology 6 . Using the approach described above and by following the instructions. This paper will build on Cloudera’s tutorial. Major components are numbered and their purpose explained below. The following represents our version of the streaming Twitter tutorial. and extend it by making the data available in-real-time and reporting on the data in Excel. Cloudera’s tutorial is documented thoroughly in a series of blog postings and the source code is available on GitHub4.Use case and supporting Hadoop components Streaming social media data is a fairly common use case for Big Data and applies across industries. Cloudera provides a tutorial that represents an implementation of this use case.

Despite the many moving parts listed in the use case. The following lists the steps of how to make this happen. Step 4 – If you are not familiar with the details of your Twitter app. you can have the streaming social media use case operational in your own lab environment in less than a working day. and the appendix for the host and guest machine specifications used in this example. Fortunately Cloudera provides a VM image that is available for download with all of the necessary Hadoop projects pre-installed and pre-configured. 4. 2. you should find the JARs referenced in the tutorial will already exist on the VM image provided by Cloudera. the GCC library was missing from the VM image. 1. we can get started. Where appropriate. you will also need VM Ware player. b. To run the lab environment. Building and configuring this environment from the OS up could take time. If you build the JARs from the source. 3. Unless you have a need to build the JARs from scratch. you need to register the Flume The Fast-Track to Hands-On Understanding of Big Data Technology 7 . Once you have a Twitter account. you can follow the instructions exactly as provided in the GitHub Tutorial instructions. and any non-obvious instructions to follow that are not provided by the instructions in the tutorial. First and foremost. Please refer to Cloudera’s system requirements. Once started. and the rationale for the amendments is also provided: a. All that is required is a Twitter account. Verify you have sufficient resources to run the VM image on your host machine. You can download and install the VM Ware player from the VMware website. you will probably need 2 days to get the tutorial operational. Before starting. For the most part. explanations have been provided to ensure the significant concepts and mechanics are understood and reinforced. Start the VM. To include the library (which is required to install other libraries): sudo su – yum install gcc c. you can begin the Cloudera Twitter tutorial. you need a CDH lab environment. When following the steps under “Configuring Flume”: i. an approach. Step 3 – We had to manually create the flume-ng-agent file with the following contents: # FLUME_AGENT_NAME=kings-river-flume FLUME_AGENT_NAME=TwitterAgent ii. this step may cause confusion. You can download the VM from Cloudera’s website.How to stream social data to Hadoop in less than a day Now that we have established a rationale. The following instructions should be followed in addition to those provided by Cloudera. and use case for learning Big Data.

Your new application will provide you with 4 security tokens that will be specified in the flume. 3.twitter.Twitter agent with twitter so that Twitter has a record of your agent and can govern the various 3rd parties that stream Twitter data. 2. Click “Create a new Application”. 1. These properties are highlighted below. Sign-in with your Twitter account. To register your Twitter App. 4. go to https://dev. The Fast-Track to Hands-On Understanding of Big Data Technology 8 . Enter the following information: 5.com.conf file.

data scientist.sources. data science.6. big data.accessTokenSecret = <access_token_secret_from_twitter> 7.batchSize = 1000 # number of events written to file before it flushed to HDFS TwitterAgent.rollCount = 10000 # Number of events written to file before it rolled 9. d.Twitter.rollSize = 0 # File size to trigger roll (in bytes) TwitterAgent.conf. the correct spelling is listed in red below: TwitterAgent. newsql.keywords = hadoop. The complete listing of the Flume parameters can be on Cloudera’s website. mahout. TwitterAgent. business intelligence.consumerSecret = <consumer_secret_from_twitter> TwitterAgent.consumerKey = <consumer_key_from_twitter> TwitterAgent. Note that the default flume. analytics. At this point you probably realize the importance of flume.hdfs. data warehouse. In flume.conf.Twitter. data warehousing.HDFS.HDFS.conf provided by Cloudera misspelled data scientist. modify the following parameter according to the key words in which you want to filter tweets.sources. cloudcomputing 8.jar in Step 1 to /usr/lib/hadoop.hdfs. businessintelligence. bigdata.hdfs. it contains the following parameters which govern how big the Flume files are before it rolls into a new file.sinks.conf. These parameters are significant because as you change them.Twitter. enter the following parameters in flume. cloudera.accessToken = <access_token_from_twitter> TwitterAgent.sources. Place flume. The Fast-Track to Hands-On Understanding of Big Data Technology 9 .sinks.HDFS. If flume. mapreduce. Using the values for the application properties highlighted above.Twitter.sources.sinks.conf under /etc/flume-ng/conf as instructed in Step 4. please download it from the GitHub project: TwitterAgent.Twitter. the latency of the tweets will also change. Now copy hive-serdes-1.sources.conf does not exist on /etc/flume-ng/conf. When following the steps under “Setting up Hive”: i.0-SNAPSHOT. hbase. nosql. In addition to containing the details of the Twitter app and the key words.

simply follow these instructions.2. you’ll want to create a new Java package using the following steps.jar com cp TwitterUtil.jar</value> </property> <property> The Fast-Track to Hands-On Understanding of Big Data Technology 10 .jar:hadoop-common.0-SNAPSHOT. It is necessary to create this Java class and JAR it so that you can exclude the temporary Flume files created as Tweets are streamed to HDFS5.0cdh4. javac com/twitter/util/FileFilterExcludeTmpFiles.jars. This will become part of the overall Hive configurations that is available to each Hive session. The second tags instruct MapReduce of the class name and location of the new Java class that we created and compiled above. Edit the file /etc/hive/conf/hive-site. After step 4.0-SNAPSHOT.jar. There is no Java programming knowledge required.java jar cf TwitterUtil.0.cp hive-serdes-1.jar /usr/lib/hadoop ii.xml.file:/// usr/lib/hadoop/TwitterUtil.1. mkdir com mkdir com/twitter mkdir com/twitter/util export CLASSPATH=/usr/lib/hadoop/hadoop-common-2.path</name> <value>file:///usr/lib/hadoop/hive-serdes-1. The first property ensures that you won’t have to add the JSON SerDe package and the new customer package that excludes Flume temporary files for each Hive session. and add the following tags. <property> <name>hive.jar /usr/lib/hadoop iii.aux.jar vi com/twitter/util/FileFilterExcludeTmpFiles.java Copy the Java source code in the appendix into the file and save it.

Bounce the hive servers: sudo service hive-server stop sudo service hive-server2 stop sudo service hive-server start sudo service hive-server2 start e.the tzOffset. Let’s say Flume is streaming the tweets to a HDFS folder. not just localhost: nameNode=hdfs://localhost. /user/flume/tweets/*.localdomain:8020 jobTracker=localhost. Finally. 1. Make sure the following parameters reference localhost. jobEnd.pathFilter. month. The jobStart.localdomain:8021 2.class</name> <value>com. jobEnd tells the Oozie workflow when to wind down. When following the steps under “Prepare the Oozie workflow”: i.<name>mapred. jobStart should be set to the initialDataset +/. For all steps. In the following example. and once the directory is available it will create execute the Hive Query Language script “add-partition. jobStart=2013-01-17T13:00Z jobEnd=2013-12-12T23:00Z initialDataset=2013-01-17T08:00Z tzOffset=-5 The Fast-Track to Hands-On Understanding of Big Data Technology 11 . and hour for which you have data and therefore can add a partition to the Hive tweets table. and initialDataSet require explanation. tzOffset. download the Oozie files from the Cloudera GitHub site.FileFilterExcludeTmpFiles</value> </property> iv. edit the job. Before Step 4. The parameter initialDataset instructs the workflow what the earliest year. so it can be set well into the future.input.localdomain referenced. day.util.properties file accordingly. ii.q”.twitter. the parameters specify that the first set of Tweets live on HDFS under / user/flume/tweets/2013/01/07/08.

Since Flume doesn’t have a good mechanism for notifying an application of when it has rolled to a new directory. If you haven’t done so already. You can browse the HDFS directory structure from the Hadoop NameNode console on your cluster. This is extremely important in making the tutorial as real-time as possible. You can also access the cluster from http://localhost.jsp. It instructs the workflow to create a new partition after an hour completes. you will see Tweets streaming to your HDFS.localdomain:11000/oozie/. there will be a lag as great as one-hour between tweets and when the tweets can be queried. Change timezone from “America/Los_Angeles” to “America/New_York” (or the corresponding timezone for your location): initial-instance=”${initialDataset}” timezone=”America/New_York”> b. The Fast-Track to Hands-On Understanding of Big Data Technology 12 . Doing so allows Oozie coordinating jobs and workflows to be accessed from the console located at http://localhost. Edit coord-app. <data-in name=”readyIndicator” dataset=”tweets”> <!-.xml: a.3. Thus if you leave this configuration as-is. enable the Oozie web console according to the Cloudera documentation. --> <instance>${coord:current(1 instance> + (coord:tzOffset() / 60))}</ </data-in> iii. which instructs Oozie not to kick off a coordinator action until the next dataset starts being available.localdomain:50070/dfshealth. Remove the following tags. The reason for this default configuration is that the tutorial did not define the custom JAR we built and deployed for Hive that instructs MapReduce to omit temporary Flume files. Once you have started the Flume Agent (under “Starting the data pipeline”). we do not have to force a full hour to complete before querying tweets. we can just use the next directory as an input event.I’ve done something here that is a little bit of a hack. i. The default Oozie workflow has defined a readyIndicator which acts as a wait event. f. Because we have deployed this custom package.

please reference the Troubleshooting Guide in the appendix 5. i. 3.ii. Configure an ODBC connection to the Hive database. The Fast-Track to Hands-On Understanding of Big Data Technology 13 . Open a new Excel workbook: 1. 2. There are several ODBC drivers for Hive.130 cloudera-vm 2. 192. From “Data” tab. but many were not compatible with Excel (e. We recommend specifying an entry in your Windows hosts file (C:\Windows\System32\drivers\etc\ hosts) to alias the IP address for your VM machine.198. Cloudera’s ODBC driver for Tableau) or not compatible with Cloudera’s environment (Microsoft’s ODBC driver for Hive. Setup ODBC connectivity through Excel: a. which only worked when connecting to Microsoft HDInsight). iv. Since we are running 32-bit Excel.168. Create an ODBC connection to the Hive Database: v. ii. If you are experiencing technical issues. ODBC connectivity to Hive from an application is a logical extension of the Cloudera Twitter tutorial. we needed to download the 32-bit ODBC driver for Hive. Select “From Data Connection Wizard”. Select “From Other Sources”. You can get the IP address from your VM by typing in the command “ifconfig”. 1. iii. but MapR has a driver for 64-bit as well.g. We successfully used MapR’s ODBC driver for Windows located here. Download and install the appropriate ODBC driver from MapR’s website.

Select the “tweets” table. 6. 7. Select “Finish”. 5. Select the DSN you set up using the MapR driver (Cloudera Hive VM MapR).4. The Fast-Track to Hands-On Understanding of Big Data Technology 14 . click next. Select “ODBC DSN”.

HQL is very SQL-like and for many of us that know SQL will be easy to adapt the queries from the appendix into other statements that provide the views you need. copy the HQL and paste it into the ”Command Text”. Using one of the Hive queries provided in the Appendix. Select “Definition” tab. 9. 10. 11. Also save password. At the time this article was written. Repeat for the remaining queries in the appendix.8. The Fast-Track to Hands-On Understanding of Big Data Technology 15 . This is the important part because we must override the HQL in order for the query to execute. Hit OK to import the data. the major ODBC drivers append “default” to Hive Query and the MapR ODBC driver is the only one able to establish connectivity which would allow us to override the HQL. Create as many queries as you see fit. Select Properties.

} } The Fast-Track to Hands-On Understanding of Big Data Technology 16 . 2.twitter. Independently.endsWith(“. How to model semi-structured JSON data in Hive and query it in a conventional manner.apache. thereby simulating some form of sentiment analysis.startsWith(“_”) && !name. Above all.fs.PathFilter. Appendix: Custom Java code for MapReduce PathFilter package com.getName().Path. you can take your understanding and ability to support additional business use cases to these next levels. 4. A quick overview of core Hadoop projects and how each is used to support streaming social media and reporting through a standard ODBC connection.hadoop.Summary Once you have successfully completed this tutorial. local development. import java. Lastly.util.apache.startsWith(“.hadoop. By demystifying Big Data technology. tmp”). you should have a clearer understanding of Hadoop. you can layer in a Mahout program to cluster and classify the tweets. import java. we recommend starting here).”) && !name. public class FileFilterExcludeTmpFiles implements PathFilter { public boolean accept(Path p) { String name = p.util.IOException.fs.io. specifically: 1. you may want to show someone from the business to illustrate what this new technology. 3. return !name. this exercise should leave individuals wanting to take the Hadoop experience to the next level. A real-world reference model for a use case illustrating the amazing streaming capabilities in Hadoop. You may also want to layer in Geospatial data into the set to provide more advanced analytics. An operational Hadoop sandbox that can be used for training. and proof of concepts that you can navigate and explore. You could consider streaming data from other social media sites (if so. import org. import java.util. import org.List.ArrayList.

1 CDH4.5GB Cloudera Manager Free Edition 4. 2.9 Addressable) 300GB VM Player 3. Permissions on / user/flume/* Change perms on /user/flume: Main class [org. localdomain:8020/user/flume/tweets/2013/01/17/10 is Missing.CoordActionInputCheckXCommand: USER[-] GROUP[-] TOKEN[-] APP[-] JOB[0000068-130117082739514-oozieoozi-C] ACTION[0000068-130117082739514-oozieoozi-C@2] [0000068-130117082739514-oozie-ooziC@2]::ActionInputCheck:: In checkListOfPaths: hdfs://localhost. 2.2 build-301548 Intel(R) Core(TM)2 Duo CPU P8400 @ 2. Place hive-serdes-1.hadoop.cloudera.027 INFO org.hadoop.SerDeException SerDe com.serde2. with the ID of ‘Tweets By Timezone_cbf7182e-a7a6-416c-a3fd-d7f484952cc6’. “Setting up Hive” to ensure the custom Java class to set the MapReduce pathFilter is built. exit code [10001] Missing MySQL driver cp /var/lib/oozie/mysql-connector-java.hive. Edit /etc/hive/conf/hive-site. cause: FAILED: Execution Error.ql.exec. serde. jar oozie-workflows/lib OLE DB or ODBC error: [MapR][Hardy] (22) Error from ThriftHiveClient: Query returned non-zero code: 2. Name of ‘Tweets By Timezone’ was being processed.2 Appendix: Troubleshooting guide Error message / stack trace Cause Resolution FAILED: RuntimeException MetaException(message:org.jar 1. jar. The current operation was cancelled because another operation in the transaction failed. hive. return code 2 from org.oozie.aux. Flume temp file permissions issue Walk through the instructions. hadoop. The Fast-Track to Hands-On Understanding of Big Data Technology 17 . HY000.0-SNAPSHOT.jar</value> </property> 3. sudo -u flume hadoop fs -chmod -R 777 / user/flume An error occurred while the partition.command.hive.26GHz.jar in / usr/lib/hadoop.1.path</name> <value>file:///usr/lib/hadoop/ hive-serdes-1.2 Linux 64-bit Microsoft Office 32-bit 23.98 GB CentOS 6.0SNAPSHOT.1.apache.file:///usr/lib/hadoop/ TwitterUtil.HiveMain]. add the following: <property> <name>hive.jars. coord.26GHz 2.JSONSerDe does not exist Hive cannot find hive-serdes-1.Appendix: Hardware/software environment Host Guest OS Processor Memory Disk Software Windows 7 Enterprise 64-bit Intel(R) Core(TM)2 Duo CPU P8400 @ 2. deployed and referenced in Hive as specified.0-SNAPSHOT.xml.apache.oozie.apache.apache.1.action. Start and restart the hive services 013-01-17 13:57:37.27GHz 8GB (7.MapRedTask.

/user/flume/).time_zone.hadoop.mapred.next(MapTask.HadoopShimsSecure$CombineFil eRecordReader.mapred.apache.apache.shims. “Setting up Hive” to ensure the custom Java class to set the MapReduce pathFilter is built.hadoop. 0.handleRecordReaderCreation Exception(HiveIOExceptionHandlerUtil. 3). variable[wfInput] cannot be resolved Oozie attempts to add a partition that does not exist Ensure the files have been streamed to the proper HDFS location (e.g.apache. runOldMapper(MapTask.HadoopShimsSecure$CombineFileRecordR eader.mapred.HiveIOExceptionHandlerUtil. 0.io. io.apache.apache. deployed and referenced in Hive as specified.java:57) at org. hadoop.time_zone IS NOT NULL GROUP BY user.java:97) at org.run(MapRunner.java:195) at org.hadoop. Follow the steps in setting up ODBC connectivity in Excel.lang.apache. or ignore the error (in some cases you may have paused the stream and do not need the files). MapTask$TrackedRecordReader.InvocationTargetException at org. COUNT(*) AS total_count FROM tweets WHERE user.hive. SUBSTR(created_at.time_zone.properties to the proper starting point for the Oozie workflow. MapTask$TrackedRecordReader.apache. modify initialDataset in job. java:48) at org.MapTask.hadoop.java:229) Flume temp file permissions issue Walk through the instructions.hadoop.hive.apache.hadoop. at org.mapred.Error message / stack trace Cause Resolution java.moveToNext(MapTask.IOException: java.run(MapTask.reflect.java:268) at java.Child$4.apache.hadoop.MapTask.HiveIOExceptionHandlerChain.mapred. Error 08001 Unable to establish connection with the Hive server ODBC driver issue The ODBC driver is prefixing “default” before the Hive table name.hive.security.shims. Defaulting to 65536 Appendix: Excel queries6 Tweets by time zone and day SELECT user. 3) ORDER BY total_count DESC LIMIT 15 The Fast-Track to Hands-On Understanding of Big Data Technology 18 .java:350) at org.MapRunner. SUBSTR(created_at.hive.hadoop.io.java:393) at org. java:210) at org.apache. handleRecordReaderCreationException(HiveIOExcep tionHandlerChain.hadoop.mapred.AccessController.doPrivileged(Native Error 01004 Out Connection String buffer size not allocated Error 01000 Batch size not set or is invalid.initNextRecordReader(HadoopShimsSecure.run(Child. java:327) at org.next(HadoopShimsSecure.

screen_name. count(*) tweet_cnt from tweets group by user.screen_name.hashtags) t1 AS hashtags GROUP BY LOWER(hashtags. sum(retweets) AS total_retweets. retweeted_status.text) t GROUP BY t.Top 15 Big Data hashtags SELECT LOWER(hashtags.screen_name order by tweet_cnt desc limit 200 The Fast-Track to Hands-On Understanding of Big Data Technology 19 .text). COUNT(*) AS total_count FROM tweets LATERAL VIEW EXPLODE(entities.user.screen_name as retweeted_screen_name. retweeted_status.user.text. max(retweet_count) as retweets FROM tweets GROUP BY retweeted_status.text) ORDER BY total_count DESC LIMIT 15 Top 10 retweeted users on Big Data topics SELECT t. count(*) AS tweet_count FROM (SELECT retweeted_status.retweeted_screen_name ORDER BY total_retweets DESC LIMIT 10 Top 200 most active users on Big Data topics select user.retweeted_screen_name.

html 2.com/blog/2012/09/analyzing-twitter-data-with-hadoop/ The Fast-Track to Hands-On Understanding of Big Data Technology 20 . http://blog.cloudera.com/content/cloudera/en/products/cdh.com/cloudera/cdh-twitter-example 5. http://blogs. Definitions from the Apache Hadoop website for each respective package 3.com/blog/2012/10/analyzing-twitter-data-with-hadoop-part-2-gathering-data-with-flume/ http://blog.cloudera. Known issue with Flume.cloudera.References 1.com/blog/2012/09/analyzing-twitter-data-with-hadoop/ http://blog.org/jira/browse/FLUME-1702 6.cloudera. http://www.hbr.html 4.com/blog/2012/11/analyzing-twitter-data-with-hadoop-part-3-querying-semi-structureddata-with-hive/ https://github.org/cs/2012/11/the_big_data_talent_gap_no_pan.cloudera. see https://issues. Adapted from http://blog.apache.

Massachusetts. with headquarters in Burlington. The company is committed to building long-term relationships and strives to be a trusted partner with every client. © 2013 Collaborative Consulting 877-376-9900 www. Founded in 1999.collaborative.com The Fast-Track to Hands-On Understanding of Big Data Technology 21 . Collaborative Consulting serves clients from offices across the United States.About Collaborative Collaborative Consulting is dedicated to helping companies optimize their existing business and technology assets.