Welcome to Scribd!

In This Usecase We Will Try To Answer Another Interesting Business Question: Are The Most Viewed

Uploaded by

0% found this document useful (0 votes)

9 views1 page

1) The document discusses ingesting and analyzing web log clickstream data using Apache Hadoop and Hive to determine which products are most viewed and compare it to most sold products. 2) It describes using Flume to ingest the data, moving sample log data to HDFS, and creating an intermediate Hive table using a regex serde to parse the unstructured log file. 3) The next steps are to create another Hive table loaded from the intermediate table and query it to group by URL containing "product" and count to find most viewed, then compare to previous results on most sold products.

Original Description:

Original Title

UseCase-02-Description

Copyright

Available Formats

PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Report this Document

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

0% found this document useful (0 votes)

9 views1 page

In This Usecase We Will Try To Answer Another Interesting Business Question: Are The Most Viewed

Uploaded by

Sqaure Pod

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

Jump to Page

You are on page 1of 1

Search inside document

Note: You need CDH for this exercise.

In this usecase we will try to answer another interesting business question: are the most viewed
products also the most sold?

Since Hadoop can store unstructured and semi-structured data alongside structured data without
remodeling an entire database, you can just as well ingest, store, and process web log events. Let's
find out what site visitors have viewed the most.

For this, you need the web clickstream data. The most common way to ingest web clickstream is to
use Apache Flume. Flume is a scalable real-time ingest framework that allows you to route, filter,
aggregate, and do "mini-operations" on data on its way in to the scalable processing platform.

But for this usecase, we will use sample access log data which is @
/opt/examples/log_data/access.log.2

1. Let's move this data from the local filesystem, into HDFS
(/user/hive/warehouse/original_access_logs).
2. Now let’s build an intermediate table in Hive. Create an external table with name
: intermediate_access_logs, fields (ip STRING,
date STRING, method STRING, url STRING, http_version STRING, code1 STRING, code2
STRING, dash STRING, user_agent STRING),use serde as
(org.apache.hadoop.hive.contrib.serde2.RegexSerDe) with (([^ ]*) - - \\[([^\\]]*)\\] "([^\ ]*) ([^\ ]*)
([^\ ]*)" (\\d*) (\\d*) "([^"]*)" "([^"]*)") as input regex and (%1$$s %2$$s %3$$s %4$$s %5$$s
%6$$s %7$$s %8$$s %9$$s) as output string format and
use /user/hive/warehouse/original_access_logs as the location. I know there's lot going on here,
let me explain. Here we are creating a hive table to load data from an unstructured log file. So
how do we parse an unstructured file, by using Regex (short for Regular expressions). So, we
need to tell hive that we are going to parse it using a regex and thats where RegexSerDe and
input regex comes into picture. Well then why do we need output string format? you should use
this whenever you have regex to tell hive which columns you want to use from the regex. hope
this helps :-) If not watch the video.
3. Now let’s create another table and load the data from the table we created in step 2. Name
it tokenized_access_logs, fields as (ip STRING,
date STRING, method STRING, url STRING, http_version STRING, code1 STRING, code2
STRING, dash STRING, user_agent STRING), and if you notice data in previous table are
delimited by ",", so use that and location would be
(/user/hive/warehouse/tokenized_access_logs). Once you create this, load the data.
4. Final step is to query from the table we created to find out answer to our question. (hint: All you
must do is group by on url and find the count. Make sure url contains the word product) and then
compare these results with the previous usecase where we queried the most sold product and
check if you find anything odd.

Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
From Everand
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Marcus Richards
No ratings yet
50 Recipes for Programming Node.js
From Everand
50 Recipes for Programming Node.js
Jamie Munro
Rating: 3 out of 5 stars
3/5 (4)
Lab 5 Correlate Structured W Unstructured Data
Document5 pages
Lab 5 Correlate Structured W Unstructured Data
Vin
No ratings yet
Hive Lecture Notes
Document17 pages
Hive Lecture Notes
Yuvaraj V, Assistant Professor, BCA
100% (1)
Trắc Nghiệm Big data
Document69 pages
Trắc Nghiệm Big data
Minh
No ratings yet
Big Data
Document17 pages
Big Data
gtfhbmnvh
No ratings yet
BDA Lab Assignment 4 PDF
Document21 pages
BDA Lab Assignment 4 PDF
parth shah
No ratings yet
Big Data Testing
Document34 pages
Big Data Testing
abhi16101
100% (1)
Hive Interview
Document17 pages
Hive Interview
mihirhota
75% (4)
Apache Hive Optimization Techniques - 1 - Towards Data Science
Document8 pages
Apache Hive Optimization Techniques - 1 - Towards Data Science
mydumm
No ratings yet
Real Time Hadoop Interview Questions From Various Interviews
Document6 pages
Real Time Hadoop Interview Questions From Various Interviews
Saurabh Gupta
No ratings yet
Hive
Document65 pages
Hive
Apurva
No ratings yet
Data Operations Hive
Document26 pages
Data Operations Hive
damannaughty1
No ratings yet
Kick-Starter Kit For BigData Developers
Document7 pages
Kick-Starter Kit For BigData Developers
kim jong un
No ratings yet
Cloudera Msazure Hadoop Deployment Guide
Document39 pages
Cloudera Msazure Hadoop Deployment Guide
Kristof
No ratings yet
Big Data Analytics: Essential Hadoop Tools
Document41 pages
Big Data Analytics: Essential Hadoop Tools
VISHNU
No ratings yet
Ab InitioFAQ2
Document14 pages
Ab InitioFAQ2
Sravya Reddy
No ratings yet
Ab Initio Interview Questions
Document6 pages
Ab Initio Interview Questions
kamnagarg87
No ratings yet
BDA Assignment I and II
Document8 pages
BDA Assignment I and II
Ashok Mane Mane
No ratings yet
Hadoop Cluster
Document23 pages
Hadoop Cluster
Anoushka Rao
No ratings yet
SME Prep
Document5 pages
SME Prep
Guruprasad Vijayakumar
No ratings yet
New 9
Document3 pages
New 9
Raj Pradeep
No ratings yet
ACA BigData Consolidated Dump
Document29 pages
ACA BigData Consolidated Dump
Ahimed Habib Husen
No ratings yet
Sem 7 - COMP - BDA
Document16 pages
Sem 7 - COMP - BDA
Raja Rajgonda
No ratings yet
An Experimental Approach Towards Big Data For Analyzing Memory Utilization On A Hadoop Cluster Using Hdfs and Mapreduce
Document6 pages
An Experimental Approach Towards Big Data For Analyzing Memory Utilization On A Hadoop Cluster Using Hdfs and Mapreduce
Pradip Kumar
No ratings yet
Apache Hive Interview Questions
Document6 pages
Apache Hive Interview Questions
sourashree
100% (1)
Bda Unit 1
Document13 pages
Bda Unit 1
CrazyYT Gaming channel
No ratings yet
BDA - II Sem - II Mid
Document4 pages
BDA - II Sem - II Mid
Polikanti Goutham
100% (1)
Pre COBOL Test
Document5 pages
Pre COBOL Test
jsmani
No ratings yet
Database Communication Using 3-Tier Architecture PDF
Document40 pages
Database Communication Using 3-Tier Architecture PDF
ThànhTháiNguyễn
No ratings yet
Warehousing
Document100 pages
Warehousing
Karthik Sakaraboyina
No ratings yet
Practical-1: Aim: Hadoop Configuration and Single Node Cluster Setup and Perform File Management Task in
Document61 pages
Practical-1: Aim: Hadoop Configuration and Single Node Cluster Setup and Perform File Management Task in
Parth
No ratings yet
Untitled
Document39 pages
Untitled
Sai Hareen
No ratings yet
DSBDSAssingment 11
Document20 pages
DSBDSAssingment 11
403 Chaudhari Sanika Sagar
No ratings yet
Ab Initio Means
Document19 pages
Ab Initio Means
Venkat Pvk
No ratings yet
Sample
Document30 pages
Sample
Soya Bean
No ratings yet
Laz DB Desktop
Document12 pages
Laz DB Desktop
Bocage Elmano Sadino
No ratings yet
Week 2 Project - Search Algorithms - CSMM
Document8 pages
Week 2 Project - Search Algorithms - CSMM
Raj kumar manepally
No ratings yet
BigData Module 2
Document41 pages
BigData Module 2
R SANJAY CS
No ratings yet
PySpark Questions
Document5 pages
PySpark Questions
Sai Krishna
No ratings yet
Case Studies C++
Document5 pages
Case Studies C++
Vaibhav Chitransh
No ratings yet
BDA Experiment 14 PDF
Document77 pages
BDA Experiment 14 PDF
Nikita Ichale
No ratings yet
Production Issues: in Beginning Almost Every Time!
Document8 pages
Production Issues: in Beginning Almost Every Time!
cortland99
No ratings yet
Bda Lab Manual
Document20 pages
Bda Lab Manual
RAKSHIT AYACHIT
No ratings yet
Hive2 PDF
Document8 pages
Hive2 PDF
Ravi Mistry
No ratings yet
Date Warehouse, Social Networking Analysis, User Profile,: - Project - , - Solution - , - Business Flow
Document5 pages
Date Warehouse, Social Networking Analysis, User Profile,: - Project - , - Solution - , - Business Flow
ABHISHEK KUMAR
No ratings yet
Akash Mavle Links To Lot of Scalable Big Data Architectures
Document57 pages
Akash Mavle Links To Lot of Scalable Big Data Architectures
akashmavle
No ratings yet
Apache Hive: Prashant Gupta
Document61 pages
Apache Hive: Prashant Gupta
Naveen Reddy
No ratings yet
Compare Hadoop & Spark Criteria Hadoop Spark
Document18 pages
Compare Hadoop & Spark Criteria Hadoop Spark
dasari ramya
No ratings yet
Best Hadoop Online Training
Document41 pages
Best Hadoop Online Training
Harika583
No ratings yet
Spark
Document7 pages
Spark
chetanruparel07aws
No ratings yet
BigData Hadoop Online Training by Experts
Document41 pages
BigData Hadoop Online Training by Experts
Harika583
No ratings yet
HOL Hive PDF
Document23 pages
HOL Hive PDF
Kishore Kumar
No ratings yet
Compiler Construction: BY Ahsan Khan Email: Ahsan@Cuiatd - Edu.Pk
Document37 pages
Compiler Construction: BY Ahsan Khan Email: Ahsan@Cuiatd - Edu.Pk
sardar Bityaan
No ratings yet
Assignment 2 - Data Structure Comparison
Document5 pages
Assignment 2 - Data Structure Comparison
Anonymous aiVnyoJb
No ratings yet
Ab Initio Faqs - v03
Document8 pages
Ab Initio Faqs - v03
giridhar_007
No ratings yet
WordCount Program Hadoop Task 2
Document7 pages
WordCount Program Hadoop Task 2
20261A6757 VIJAYAGIRI ANIL KUMAR
No ratings yet
Abinitio Questions
Document2 pages
Abinitio Questions
tirupatirao pasupulati
No ratings yet
HADOOP and PYTHON For BEGINNERS - 2 BOOKS in 1 - Learn Coding Fast! HADOOP and PYTHON Crash Course, A QuickStart Guide, Tutorial Book by Program Examples, in Easy Steps!
Document89 pages
HADOOP and PYTHON For BEGINNERS - 2 BOOKS in 1 - Learn Coding Fast! HADOOP and PYTHON Crash Course, A QuickStart Guide, Tutorial Book by Program Examples, in Easy Steps!
Antony George Sahayaraj
No ratings yet
Inspiring Powershell Articles
From Everand
Inspiring Powershell Articles
Murat Yildirimoglu
No ratings yet
Guntur Municipal Corporation: Receipt
Document1 page
Guntur Municipal Corporation: Receipt
Sqaure Pod
No ratings yet
6 Real-World Case Studies: Data Science For Business
Document18 pages
6 Real-World Case Studies: Data Science For Business
Sqaure Pod
No ratings yet
1 For Course AWS Machine Learning 1
Document9 pages
1 For Course AWS Machine Learning 1
Sqaure Pod
No ratings yet
Variable Assignment and Strings 4. Tuples: Print
Document2 pages
Variable Assignment and Strings 4. Tuples: Print
Sqaure Pod
No ratings yet
SRM Guide
Document8 pages
SRM Guide
ROHIT CHUGH
No ratings yet
Exchange 2013 High Availability and Site Resilience
Document4 pages
Exchange 2013 High Availability and Site Resilience
Khodor Akoum
No ratings yet
DLL Ict 10 Week 3RD Quarter Feb 18-22, 2019
Document3 pages
DLL Ict 10 Week 3RD Quarter Feb 18-22, 2019
Bernadeth Irma Sawal Caballa
67% (3)
Barangay Accounting System of Barangay Halayhayin
Document2 pages
Barangay Accounting System of Barangay Halayhayin
ramilfleco
100% (2)
NXT Shop Documentation
Document79 pages
NXT Shop Documentation
hgdhjqdgqwjk
No ratings yet
Dorks Cameras
Document9 pages
Dorks Cameras
ydrthr
0% (1)
En TS 8.1.1 TSCSTA Book
Document260 pages
En TS 8.1.1 TSCSTA Book
Sandor Varga
No ratings yet
More Than 100 Keyboard Shortcuts Must Read
Document4 pages
More Than 100 Keyboard Shortcuts Must Read
Joe A. Cagas
No ratings yet
Computer All Shortcuts 2
Document3 pages
Computer All Shortcuts 2
Keerthana M
No ratings yet
Activity Using Paint: Lesson
Document15 pages
Activity Using Paint: Lesson
JOVITA SARAOS
No ratings yet
Loaders: Loader Is A Program Which Accepts Object Program, Prepares These Program For Execution
Document21 pages
Loaders: Loader Is A Program Which Accepts Object Program, Prepares These Program For Execution
neetu kalra
100% (1)
Prerequisites For Oracle FLEXCUBE Installer Oracle FLEXCUBE Universal Banking Release 12.2.0.0.0 (May) (2016)
Document11 pages
Prerequisites For Oracle FLEXCUBE Installer Oracle FLEXCUBE Universal Banking Release 12.2.0.0.0 (May) (2016)
Mulualem
No ratings yet
Coding VBA
Document16 pages
Coding VBA
Purna Cliquer's
No ratings yet
OS X Lion Artifacts v1.0
Document39 pages
OS X Lion Artifacts v1.0
opexxx
No ratings yet
Configuring Lifebeat Monitoring For An OS Client
Document12 pages
Configuring Lifebeat Monitoring For An OS Client
anon-957947
100% (4)
Chart-Graph in Excel
Document11 pages
Chart-Graph in Excel
Akhilesh Yadav
No ratings yet
BR100 Template
Document6 pages
BR100 Template
joeb00gie
No ratings yet
Simple Excel Sheet To Mysql Conversion Using Java
Document6 pages
Simple Excel Sheet To Mysql Conversion Using Java
ravibecks2300
No ratings yet
Using Impact To Integrate The Active Event List and Charting Portlets in TheTivoliIntegratedPortal
Document14 pages
Using Impact To Integrate The Active Event List and Charting Portlets in TheTivoliIntegratedPortal
titi2006
No ratings yet
Assignment No 11 Study of Mongodb Command - 181021004
Document20 pages
Assignment No 11 Study of Mongodb Command - 181021004
Shivpriya Amale
No ratings yet
BMA - ENP How To Get E-NP Data
Document4 pages
BMA - ENP How To Get E-NP Data
Cosmin
No ratings yet
Sybase - ASE - Performance
Document12 pages
Sybase - ASE - Performance
Gilberto Dias Soares Jr.
No ratings yet
Detailed Lesson Plan in Technology Productivity Software Applications
Document14 pages
Detailed Lesson Plan in Technology Productivity Software Applications
Nyca Pacis
100% (1)
Imagemodeler Userguide 31-40
Document10 pages
Imagemodeler Userguide 31-40
Jose L. B.S.
No ratings yet
(Document 2) GCP - Exam Registration Steps - v1
Document12 pages
(Document 2) GCP - Exam Registration Steps - v1
Joydip Mukhopadhyay
No ratings yet
Certification Objectives: Q&A Self-Test
Document32 pages
Certification Objectives: Q&A Self-Test
dongsongquengoai4829
No ratings yet
FLOW 3D v12 0 Install Instructions
Document31 pages
FLOW 3D v12 0 Install Instructions
Yayang Saputra
No ratings yet
Introducting Perforce - Helix
Document30 pages
Introducting Perforce - Helix
pankaj@23
No ratings yet
2.6.1.3 Packet Tracer - Configure Cisco Routers For Syslog, NTP, and SSH Operations
Document7 pages
2.6.1.3 Packet Tracer - Configure Cisco Routers For Syslog, NTP, and SSH Operations
احمد الجناحي
No ratings yet
XX Plane Com Manuals Desktop
Document337 pages
XX Plane Com Manuals Desktop
Papp Attila
No ratings yet