You are on page 1of 1

Note: You need CDH for this exercise.

In this usecase we will try to answer another interesting business question: are the most viewed
products also the most sold?

Since Hadoop can store unstructured and semi-structured data alongside structured data without
remodeling an entire database, you can just as well ingest, store, and process web log events. Let's
find out what site visitors have viewed the most.

For this, you need the web clickstream data. The most common way to ingest web clickstream is to
use Apache Flume. Flume is a scalable real-time ingest framework that allows you to route, filter,
aggregate, and do "mini-operations" on data on its way in to the scalable processing platform.

But for this usecase, we will use sample access log data which is @
/opt/examples/log_data/access.log.2

1. Let's move this data from the local filesystem, into HDFS
(/user/hive/warehouse/original_access_logs).
2. Now let’s build an intermediate table in Hive. Create an external table with name
: intermediate_access_logs, fields (ip STRING,
date STRING, method STRING, url STRING, http_version STRING, code1 STRING, code2
STRING, dash STRING, user_agent STRING),use serde as
(org.apache.hadoop.hive.contrib.serde2.RegexSerDe) with (([^ ]*) - - \\[([^\\]]*)\\] "([^\ ]*) ([^\ ]*)
([^\ ]*)" (\\d*) (\\d*) "([^"]*)" "([^"]*)") as input regex and (%1$$s %2$$s %3$$s %4$$s %5$$s
%6$$s %7$$s %8$$s %9$$s) as output string format and
use /user/hive/warehouse/original_access_logs as the location. I know there's lot going on here,
let me explain. Here we are creating a hive table to load data from an unstructured log file. So
how do we parse an unstructured file, by using Regex (short for Regular expressions). So, we
need to tell hive that we are going to parse it using a regex and thats where RegexSerDe and
input regex comes into picture. Well then why do we need output string format? you should use
this whenever you have regex to tell hive which columns you want to use from the regex. hope
this helps :-) If not watch the video.
3. Now let’s create another table and load the data from the table we created in step 2. Name
it tokenized_access_logs, fields as (ip STRING,
date STRING, method STRING, url STRING, http_version STRING, code1 STRING, code2
STRING, dash STRING, user_agent STRING), and if you notice data in previous table are
delimited by ",", so use that and location would be
(/user/hive/warehouse/tokenized_access_logs). Once you create this, load the data.
4. Final step is to query from the table we created to find out answer to our question. (hint: All you
must do is group by on url and find the count. Make sure url contains the word product) and then
compare these results with the previous usecase where we queried the most sold product and
check if you find anything odd.

You might also like