You are on page 1of 24

World Sales Analysis

Description: Our project “World Sales Analysis” is focused on the sales of


different commodities across the world through online and offline mode
of shopping. We will use HDFS, Hive, Hbase and zeppelin to analyse the
data .

Step 1: Firstly, we upload “worldSalesRecord.csv” file to Hadoop with


path hadoop/user/maria_dev.

Step 2: Afterwards, we create hive external table using following


query.Before that making different database for storing tables related to
our project would be crucial .

Query : Create database worldsalesInfo


Now moving ahead to create external table
CREATE EXTERNAL TABLE IF NOT EXISTS worldSalesData_external(id
Order_Id, Region string,Country string, Item_Type string, Sales_channel
string,Customer_Id int,Order_Date date , Ship_Date date, Units_Sold int,
Unit_Price Decimal,Unit_Cost Decimal,Total_Revenue Decimal, Total_Cost
Decimal, Total_Profit Decimal)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY ','

STORED AS TEXTFILE

LOCATION '/user/maria_dev;

Now database “worldsalesInfo” contains our table

Step3 : Loading a data from file to table is the next step.We use
following query
LOAD DATA INPATH '/User/maria_dev/worldSalesData.csv' overwrite INTO
TABLE worldSalesData_external

Data has been loaded to a table.

Scrolling right
All the records present in csv file are now stored in external hive table

Step 4: We then create internal table as follows

Now my database has internal table too


Step 5: Having done that, we can now proceed further with loading of
new table with the external table as follows

Viewing data from internal table

Scrolling all the way to right and the bottom


Step 6: Now we first create hbase table and then hive table which maps
on that table.

Create hBase table with one column family

Now a table would be created in Hive .


Now, populating table in hive.
Table in hbase has also been populated
Now, viewing results with order id 9

Filtering Online sales


Filtering Offline Sales

From results, it is vivid that Amazon has sold more products through
online mode.

Creating new notebook


Notebook created
Creating dataframe

Printing schema:
Creating dataframe for customers.csv

Printing schema
Data:

Customers data:
Creating temp view:
Now we can work with views using spark2.sql
Checking views data:
Customers who have placed orders:
Total revenue group by region:

Scatter chart:
Itemtypes with unit cost less than 100
How many customers like offline and online sales channel

Online-2378

Offline-2950
Customers records who have placed orders
Customers data as per the unit cost of the order placed and
filtering by order date

You might also like