Professional Documents
Culture Documents
Objective:
In this lab, you will learn to:
Scenario:
Produce - a report on structured data to answer interesting business questions such as; are the
most viewed products also the most sold? Hadoop can store unstructured and semi-structured
data alongside structured data without remodeling an entire database. In this lab, you will
learn to ingest, store, and process web log events to find out what site visitors have actually
viewed the most.
For this, you need the web clickstream data. The most common way to ingest web
clickstream is to use Apache Flume or Apache Kafka. We shall prepared a web clickstream
data set for you to perform bulk upload into HDFS directly using Flume.
For convenience, we have pre-loaded some sample access log data into
/opt/examples/log_data/access.log.2. This file can also be found in the PolyMall Lab Sheet
menu. The data sample of the file access.log.2 is shown in the Annex.
(Note: Hive RegexSerDe can be used to extract columns from the input file using
regular expressions. The motivation to create such a SerDe was to process Apache
web logs. There are two classes available. There are two classes available:
The SerDe works by matching columns in the table definition with regex groups
defined and captured by the regular expression. A regex group is defined by
parenthesis "(...)" inside the regex. Regarding the number of columns in the table
definition and the number of regex group, they must match, otherwise a warning
is printed and the table is not populated. The regex is provided as a SerDe required
property called "input.regex". The real power of RegexSerDe is that it can operate
not only on delimiter boundaries, as shown above, but also inside individual
columns. Besides processing web logs and extracting desired fields and patterns
from the input file another common use case of RegexSerDe is to read files with
multi-character field delimiters because "FIELDS TERMINATED BY" doesn't
support them.)
b) Transfer the data from this intermediate table to one that does not require any
special SerDe. Once the data is in this table, you can query it much faster and
more interactively using Impala. Note the {{lib_dir}} is located at /usr/lib in
the Cloudera CDH VM, hence you need to replace the command {{lib_dir}}
with /usr/lib below.
(Note: Use “add jar” to add the hive-contrib to Hive to use classes found in the jar.
The hive-contrib jar provides the classes that you might need. For example,
UDFRowSequence is available in hive-contrib.jar. You must add the jar explicitly
before executing a Hive query that uses that class.)
The final query will take a minute to run. It is using a MapReduce job, just like our
Sqoop import did, to transfer the data from one table to the other in parallel.
1. To tell Impala that some tables have been created through a different tool. Enter
the following command in the Impala Query Editor.
invalidate metadata;
2. Refresh the tables list in the left-hand column, you should see two external tables
in the default database.
1. By introspecting the results you quickly realize that this list contains many of the products
on the most sold list from previous tutorial steps, but there is one product that did not
show up in the previous result. There is one product that seems to be viewed a lot, but
never purchased. Why?
7. Summary