You are on page 1of 39

Hive Case Study

Cosmetic Store

Deepak
Creating S3 Bucket:
Created folders in the buckets
Uploaded files in the folders
Copying the data set into the HDFS:

1. Launching EMR cluster that utilizes the Hive services:


2. Move the data from the S3 bucket into the HDFS :

Command used for Nov_2019 file:


aws s3 cp s3://cosmeticsales/Nov_2019/2019-Nov.csv .

Command used for Oct_2019 file:


aws s3 cp s3://cosmeticsales/Oct_2019/2019-Oct.csv
Checking if files are copied:
Creating the database and launching Hive queries on your EMR cluster:

1. Create and show database:


create database if not exists sales ;

2. Using database:
Use sales ;
Describe db

Database is created in user/hive/warehouse location in hdfs.


Creating Table:
CREATE TABLE IF NOT EXISTS cosmetic_sales(
event_time timestamp,
event_type string,
product_id string,
category_id string,
category_code string,
brand string, price float,
user_id bigint,
user_Session string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ',',
"timestamp.formats"="yyyy-MM-dd HH:mm:ss 'UTC'"
)
Table cosmetic_sales created successfully.
To skip column names:
ALTER TABLE cosmetic_sales SET TBLPROPERTIES ("skip.header.line.count"="1");

Describe table
Load data into table from Nov_2019:
LOAD DATA LOCAL INPATH "/home/hadoop/2019-Nov.csv" INTO TABLE cosmetic_sales ;

Load data into table from Oct_2019:


LOAD DATA LOCAL INPATH "/home/hadoop/2019-Nov.csv" INTO TABLE cosmetic_sales ;
Data loaded successfully:
Creating Partitioned table for optimization
Describe partitioned table
Show partitions created
Insert in partition file
Queries:
1. Find the total revenue generated due to purchases made in October.  
select sum(price) as Total_Revenue from cosmetic_sales where
event_type="purchase" and month(event_time)=10;
2. Write a query to yield the total sum of purchases per month in a single output.
Select sum(price) as Monthly_purchase from cosmetic_store where event_type='purchase'
group by month(event_time);

• The above result shows that the purchases made in the month of November -2019 are
greater than October-2019. This may be due to the festive season sale such as Black Friday
sale.
3. Write a query to find the change in revenue generated due to purchases from
October to November.
WITH Total_Monthly_Revenue AS (
SELECT
SUM (CASE WHEN date_format(event_time, 'MM')=10 THEN price ELSE 0 END) AS
October_Revenue,
SUM (CASE WHEN date_format(event_time, 'MM')=11 THEN price ELSE 0 END) AS
November_Revenue
FROM cosmetic_sales
WHERE event_type= 'purchase'
AND date_format(event_time, 'MM') in ('10','11')
)
SELECT November_Revenue, October_Revenue, (November_Revenue-October_Revenue) AS 
Difference_Of_Revenue FROM Total_Monthly_Revenue;
• Positive result in the above query shows that the purchases made in the
month of November are more than the month of October by
319478.469592195 units.
4. Find distinct categories of products. Categories with null category code can be
ignored.
Select distinct(category_code) as Distinct_Categories from cosmetic_sales;

• There are total 11 distinct category_code in the combined data of October and
November, 2019. 
5. Find the total number of products available under each category:
SELECT category_id, COUNT(distinct(product_id)) AS total_products FROM cosmetic_sales GROUP BY
category_id

Total of 500 categories are present


Running with partitioned table:
5. Find the total number of products available under each category:
SELECT category_id, COUNT(distinct(product_id)) AS total_products FROM cosmetic_store
GROUP BY category_id

Total of 500 categories are present


6. Which brand had the maximum sales in October and November combined?
SELECT brand, sum(price) as sales from cosmetic_sales where brand != '' group by brand order by
sales desc limit 1 ;

Strong is the brand which has maximum sales for both months combined.
Running with partitioned table:
6. SELECT brand, sum(price) as sales from cosmetic_store where brand != '' group by
brand order by sales desc limit 1 ;

Same result as non partitioned table.


7.Which brands increased their sales from October to November?(135)
select brand
from cosmetic_store
group by brand
having (sum(case when date_format(event_time,'MM')=10 then price else 0 end) >
sum(case when date_format(event_time,'MM')=11 then price else 0 end)
);
Total 135 brands sales increased in November from October.
Running with partitioned table:
7.Which brands increased their sales from October to November?(135)
select brand
from cosmetic_store
group by brand
having (sum(case when date_format(event_time,'MM')=10 then price else 0 end) >
sum(case when date_format(event_time,'MM')=11 then price else 0 end)
);
Same result as non partitioned table.
8. Your company wants to reward the top 10 users of its website with a Golden Customer plan.
Write a query to generate a list of top 10 users who spend the most.
Select user_id, sum(price) as total_spent from cosmetic_sales group by user_id order by total_spent desc limit
10;

557616099 is the user who has spent the most


Running with partitioned table:
8. Your company wants to reward the top 10 users of its website with a Golden Customer plan.
Write a query to generate a list of top 10 users who spend the most.
Select user_id, sum(price) as total_spent from cosmetic_sales group by user_id order by total_spent desc limit
10;
Delete tables
Delete database
Terminate cluster

You might also like