You are on page 1of 5

ITD351 Big Data Management Essentials

Lab 5 Correlate structured data with unstructured data

Objective:
In this lab, you will learn to:

1. Uploading bulk data to Hadoop


2. Create a (an intermedia) table in Hive
3. Parse logs into individual fields of the intermediate table using regular expression
4. Transfer the data into a separate table
5. Query this table via Apache Impala and Hue

Scenario:

Produce - a report on structured data to answer interesting business questions such as; are the
most viewed products also the most sold? Hadoop can store unstructured and semi-structured
data alongside structured data without remodeling an entire database. In this lab, you will
learn to ingest, store, and process web log events to find out what site visitors have actually
viewed the most.

For this, you need the web clickstream data. The most common way to ingest web
clickstream is to use Apache Flume or Apache Kafka. We shall prepared a web clickstream
data set for you to perform bulk upload into HDFS directly using Flume.

What is Apache Flume?


Flume is a scalable real-time ingest framework that allows you to route, filter, aggregate, and
do "mini-operations"on data on its way in to the scalable processing platform. At the end of
this exercise lab, we shall explore the Flume configuration example.

What is Apache Kafka?


Kafka is an open-source technology which can be used to capture data in real-time from
event sources like databases, sensors, mobile devices, cloud services, and software
applications in the form of streams of events. These event streams durably can be stored for
later retrieval; manipulating, processing, and reacting to the event streams in real-time as well
as retrospectively. Kafka can also be used to route the event streams to different destination
technologies or front-end BI tools as needed. Event streaming processing thus ensures a
continuous flow and interpretation of data so that the right information is at the right place, at
the right time.

ITD351 Lab 2 Page 1


Tasks:

A. Bulk Upload Data

For convenience, we have pre-loaded some sample access log data into
/opt/examples/log_data/access.log.2. This file can also be found in the PolyMall Lab Sheet
menu. The data sample of the file access.log.2 is shown in the Annex.

1. Move the data /opt/examples/log_data/access.log.2 from the local


filesystem, into HDFS.
$sudo -u hdfs hadoop fs –mkdir \
/user/hive/warehouse/original_access_logs
$sudo -u hdfs hadoop fs -copyFromLocal \
/opt/examples/log_files/access.log.2 \
/user/hive/warehouse/original_access_logs
2. Verify that your data is in HDFS
$hadoop fs -ls \
/user/hive/warehouse/original_access_logs
3. Build a table in Hive and query the data via Apache Impala and Hue. You'll build
this table in 2 steps.
a) Use the Hive Query Editor app in Hue to execute the following queries. Take
advantage of Hive's flexible SerDes (serializers / deserializers) to parse the
logs into individual fields using a regular expression (regex).

CREATE EXTERNAL TABLE intermediate_access_logs (


ip STRING,
date STRING,
method STRING,
url STRING,
http_version STRING,
code1 STRING,
code2 STRING,
dash STRING,
user_agent STRING)
ROW FORMAT SERDE
'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
'input.regex' = '([^ ]*) - - \\[([^\\]]*)\\] "([^\ ]*) ([^\
]*) ([^\ ]*)" (\\d*) (\\d*) "([^"]*)" "([^"]*)"',
'output.format.string' = "%1$$s %2$$s %3$$s %4$$s %5$$s
%6$$s %7$$s %8$$s %9$$s")
LOCATION '/user/hive/warehouse/original_access_logs';

(Note: Hive RegexSerDe can be used to extract columns from the input file using
regular expressions. The motivation to create such a SerDe was to process Apache
web logs. There are two classes available. There are two classes available:

• org.apache.hadoop.hive.contrib.serde2.RegexSerDe, introduced in Hive-


0.4 by HIVE-662, and
• org.apache.hadoop.hive.serde2.RegexSerDe, a built-in class introduced in
ITD351 Lab 2 Page 2
Hive-0.10 by HIVE-1719
The former is kept to facilitate easier migration for legacy apps, while the latter is
recommended for the new apps.

The SerDe works by matching columns in the table definition with regex groups
defined and captured by the regular expression. A regex group is defined by
parenthesis "(...)" inside the regex. Regarding the number of columns in the table
definition and the number of regex group, they must match, otherwise a warning
is printed and the table is not populated. The regex is provided as a SerDe required
property called "input.regex". The real power of RegexSerDe is that it can operate
not only on delimiter boundaries, as shown above, but also inside individual
columns. Besides processing web logs and extracting desired fields and patterns
from the input file another common use case of RegexSerDe is to read files with
multi-character field delimiters because "FIELDS TERMINATED BY" doesn't
support them.)

b) Transfer the data from this intermediate table to one that does not require any
special SerDe. Once the data is in this table, you can query it much faster and
more interactively using Impala. Note the {{lib_dir}} is located at /usr/lib in
the Cloudera CDH VM, hence you need to replace the command {{lib_dir}}
with /usr/lib below.

CREATE EXTERNAL TABLE tokenized_access_logs (


ip STRING,
date STRING,
method STRING,
url STRING,
http_version STRING,
code1 STRING,
code2 STRING,
dash STRING,
user_agent STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/user/hive/warehouse/tokenized_access_logs';

ADD JAR {{lib_dir}}/hive/lib/hive-contrib.jar;

INSERT OVERWRITE TABLE tokenized_access_logs SELECT * FROM


intermediate_access_logs;

(Note: Use “add jar” to add the hive-contrib to Hive to use classes found in the jar.
The hive-contrib jar provides the classes that you might need. For example,
UDFRowSequence is available in hive-contrib.jar. You must add the jar explicitly
before executing a Hive query that uses that class.)

The final query will take a minute to run. It is using a MapReduce job, just like our
Sqoop import did, to transfer the data from one table to the other in parallel.

Check the new file is created in the HDFS directory under


/user/hive/warehouse/tokenized_access_logs/000000_0. View the file using the -
cat command. What do you see?

ITD351 Lab 2 Page 3


B. Switch to Impala Query Editor :

1. To tell Impala that some tables have been created through a different tool. Enter
the following command in the Impala Query Editor.

invalidate metadata;

2. Refresh the tables list in the left-hand column, you should see two external tables
in the default database.

3. Type the following query in the Query Editor

select count(*),url from tokenized_access_logs


where url like '%\/product\/%'
group by url order by count(*) desc;

1. By introspecting the results you quickly realize that this list contains many of the products
on the most sold list from previous tutorial steps, but there is one product that did not
show up in the previous result. There is one product that seems to be viewed a lot, but
never purchased. Why?

7. Summary

In this lab, you have be introduced to:

1. Uploading bulk data to Hadoop


2. Creating a (an intermedia) table in Hive
3. Parsing logs into individual fields of the intermediate table using regular expression
4. Transferring the data into a separate table
5. Quering this table via Apache Impala and Hue

It is important to have an efficient and interactive tool to enable analytics on high-volume


semi-structured data. There is risk of loss if an organization looks for answers within partial
data. Correlating two data sets for the same business question showed value, and being able
to do so within the same platform made life easier for you and for the organization.

This is the end of the lab

ITD351 Lab 2 Page 4


Annex
Column1 Column2 Column3 Column4 Column5 Column6 Column7 Column8 Column9 Column10
79.133.215.123 - - [14/Jun/2014:10:30:13 -0400] GET /home HTTP/1.1 200 1671 - Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36
162.235.161.200 - - [14/Jun/2014:10:30:13 -0400] GET /department/apparel/category/featured%20shops/product/adidas%20Kids'%20RG%20III%20M 200 1175 - Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.76.4 (KHTML, like Gecko) Version/7.0.4 Safari/537.76.4
39.244.91.133 - - [14/Jun/2014:10:30:14 -0400] GET /department/fitness HTTP/1.1 200 1435 - Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36
150.47.54.136 - - [14/Jun/2014:10:30:14 -0400] GET /department/fan%20shop/category/water%20sports/product/Pelican%20Sunstream%20100%2 200 1932 - Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36
217.89.36.129 - - [14/Jun/2014:10:30:14 -0400] GET /view_cart HTTP/1.1 200 1401 - Mozilla/5.0 (Windows NT 6.1; WOW64; rv:30.0) Gecko/20100101 Firefox/30.0
36.44.59.115 - - [14/Jun/2014:10:30:15 -0400] GET /department/footwear/category/cardio%20equipment HTTP/1.1 200 386 - Mozilla/5.0 (Windows NT 6.1; WOW64; rv:30.0) Gecko/20100101 Firefox/30.0
11.252.83.179 - - [14/Jun/2014:10:30:15 -0400] GET /view_cart HTTP/1.1 200 1726 - Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36
56.251.19.230 - - [14/Jun/2014:10:30:15 -0400] GET /department/footwear/category/fitness%20accessories HTTP/1.1 200 2076 - Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36
137.95.229.186 - - [14/Jun/2014:10:30:16 -0400] GET /department/fan%20shop/category/fishing/product/Field%20&%20Stream%20Sportsman%201 200 1413 - Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36
100.98.159.99 - - [14/Jun/2014:10:30:16 -0400] GET /department/fan%20shop/category/water%20sports/product/Pelican%20Sunstream%20100%2 200 396 - Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36

ITD351 Lab 2 Page 5

You might also like