You are on page 1of 35

CIS-5270 BUSINESS INTELLIGENCE

Superstore Data Analysis

By:

Monika Mishra

Nanjesh Ramesh

CIS 5270: Business Intelligence

Submitted to: Professor Shilpa Balan

1
CIS-5270 BUSINESS INTELLIGENCE

Table of Contents

S. No. Topic Page No.

1 Introduction and Goal 3

2 Data Set

1. Data Set URL 4

2. About the dataset 4

3. Dataset details 4

4-5
4. Column details

3 Data Cleaning

1. Renaming column 6-7

2. Removing unwanted column 8-9

3. Duplicating and splitting column 10-11

4 Analysis & Visualizations

1. Bar Chart 12-13

2. Histogram 14-15

3. Pie Chart 16-17

4. Tree Map 18-19

5. Correlation Matrix
20-21

6. Word Cloud 22-23

5 Statistical Summary & Functions

1. Statistical Summary 24-25

2. User Defined Functions 26-30

6 Code Summary 31-35

2
CIS-5270 BUSINESS INTELLIGENCE

INTRODUCTION AND GOAL

1. Introduction:

Superstores industry comprises of companies that operate by having large size spaces

which store and supply large amounts of goods. The superstore industry is comprised of

extensive stores that sell a typical product line of grocery items and merchandise

products, such as food, pharmaceuticals, apparel, games and toys, hobby items, furniture

and appliances. The analysis of such industry is of great importance as it gives insights

for the sales and profits of various products. Our analysis is based on a superstore dataset

for US country where the products are ordered between 2015 and 2018.

2. Goal: To find out various supermarket statistics such as –

 Region that accounts for greater number of orders

 Frequency distribution of quantity ordered

 Percentage sales by category

 Profitable category and sub-category

 Category and sub-category that incurred losses

 Product type that was ordered greater times

 Yearly sales for various state.

With this analysis, the Superstore can identify various aspects of the shopping pattern and

take measures if required.

3
CIS-5270 BUSINESS INTELLIGENCE

DATA SET

1. Data Set URL:

https://data.world/stanke/sample-superstore-2018

2. About the dataset:

The dataset provides information about the sales and profit from a US supermarket from

the year 2015 to 2018.

3. Dataset details:

Size 2.4 MB
Number of columns 21
Number of rows 9994
Original file format XLS

4. Column details:

The dataset contains the following columns-

Column Name Column Detail

Row ID Unique row ID

Order ID Unique Order ID

Order Date Ordered Date of the Order

Ship Date Shipping Date of the Order

Ship Mode Shipping mode of the order

4
CIS-5270 BUSINESS INTELLIGENCE

Customer ID Unique ID of Customers

Customer Name Customer’s name

Segment Product Segment

Country US

City City of product ordered

State State of product ordered

Postal Code Postal code for the order

Region Region of product ordered

Product ID Unique Product id

Category Product category

Sub-Category Product sub-category

Product Name Name of the product

Sales Sales contribution of the order

Quantity Quantity ordered

Discount Discount provided on order

Profit Profit for the order

5
CIS-5270 BUSINESS INTELLIGENCE

DATA CLEANING

1. Renaming Column

Goal: The Colum name “CT” was not proper. The aim is to rename the column to “City”

Before

After

Code Used

6
CIS-5270 BUSINESS INTELLIGENCE

colnames(superstore)[colnames(superstore)=="CT"] <- "City"

Full Screenshot

7
CIS-5270 BUSINESS INTELLIGENCE

2. Removing unwanted Column

Goal: The Column named “Country” needs to be removed as it contains only one value
“United States”

Before

After

8
CIS-5270 BUSINESS INTELLIGENCE

Code Used

superstore = subset(superstore, select = -c(Country) )

Full Screenshot

9
CIS-5270 BUSINESS INTELLIGENCE

3. Duplicating the column and Splitting it into 3 columns

Goal: To duplicate the column “Order.Date” to “order” and then split “order” into month,
day and year

Before
No column after Profit

After

After duplicating After splitting order column

10
CIS-5270 BUSINESS INTELLIGENCE

Code Used

superstore$order<-superstore$Order.Date

library(tidyr)

superstore<-separate(superstore,order,c("month","day","year"),sep="/")

Full Screenshot

11
CIS-5270 BUSINESS INTELLIGENCE

ANALYSIS & VISUALIZATIONS

1. What is the total number of orders by region?

Plot Type - Bar Chart

Function Used – barplot, table

Analysis

The above bar chart displays the total number of orders by region. It can be seen that the

Western region has the maximum order count (greater than 3000). The Western region is

followed by the Eastern region having an order count close to 3000. It is then followed by

the Central region with a count of around 2300. The least order has been placed by

Southern region (around 1500).

12
CIS-5270 BUSINESS INTELLIGENCE

Code Used

> countsR <- table(superstore$Region)

> barplot(countsR, main="Total Orders by Region",

+ xlab="Region", col="lightblue")

Full Screenshot

13
CIS-5270 BUSINESS INTELLIGENCE

2. What is the frequency distribution of quantity ordered?

Plot Type - Histogram

Function Used – hist

Analysis

The above histogram chart shows the frequency distribution of the quantity ordered. The

maximum ordered quantity is 1 which is greater than 3000. It is then followed by 2, the

frequency for which is close to 2500. Generally speaking, the frequency count is

decreasing as the quantity ordered is increasing. The quantity ordered 14 has the least

frequency.

14
CIS-5270 BUSINESS INTELLIGENCE

Code Used

> hist(superstore$Quantity, main="Frequency Distribution of Quantity

Ordered",

+ xlab="Quantity Ordered", ylab= "Frequency", col="lightpink")

Full Screenshot

15
CIS-5270 BUSINESS INTELLIGENCE

3. What is the percentage sales by category?

Plot Type – Pie Chart

Function Used – pie, group_by, summarize, round, paste

Analysis

The above pie chart shows the percentage sales by category. There are three categories –

Technology, Furniture and Office Supplies. Product category “Technology” has

contributed maximum towards sales which is 36%. It is then followed “Furniture” which

is 32%. “Office Supplies” has contributed the least which is 31%.

16
CIS-5270 BUSINESS INTELLIGENCE

Code Used

> install.packages("dplyr")

> library("dplyr")

> library(magrittr)

> gd <- superstore %>% group_by(Category) %>% summarize(Sales=sum(Sales))

> pct<-round(gd$Sales/sum(gd$Sales)*100)

> lbls<-paste(gd$Category,pct)

> lbls<-paste(lbls, "%", sep= " ")

> colors = c('lightskyblue','plum2','peachpuff')

> pie(gd$Sales, labels = lbls,main="Percentage Sales By Category",col=colors)

Full Screenshot

17
CIS-5270 BUSINESS INTELLIGENCE

4. Which sub-category incurred losses? Which is the most profitable sub-category?


How are the overall sales for various category and sub-category?

Plot Type – Tree Map

Function Used – list, treemap

Analysis

The above is a Tree Map which provides information about the sales and profit of various

product category and sub-category. The cell size is decided by the sales. The color

gradient describes the profit. It can be concluded from the above map that the sub-

category “Phones” under “Technology” has the highest sale. The sub-category

“Furniture” incurred losses. Most profitable sub-category is “Copiers”.

18
CIS-5270 BUSINESS INTELLIGENCE

Code Used

> install.packages("treemap")

> library(treemap)

> treemap(data,index = c("Category","Sub.Category"),vSize ="Sales",vColor =

"Profit",type="value",palette="RdYlGn",range=c(-20000,60000),mapping=c(-

20000,10000,60000),title = "Sales Treemap For categories",fontsize.labels =

c(15,10),align.labels = list(c("centre","centre"),c("left","top")))

Full Screenshot

19
CIS-5270 BUSINESS INTELLIGENCE

5. What is the co-relationship between Sales, Quantity, Discount and Profit?

Plot Type – Correlation Matrix

Function Used – corrplot, cor

Analysis

This is a co-relation matrix chart which provide the co-relationship information about

various variables. The color gradient from Red to Blue describes the extent of co-

relationship among Sales, Quantity, Discount and Profit, red being the negative co-

relationship and blue being the positive co-relationship. It can be seen that “Sales” and

“Profit” are somewhat related. “Profit” and “Quantity” are also very weakly related.

“Profit” and “Discount” are negatively related.

20
CIS-5270 BUSINESS INTELLIGENCE

Code Used

> install.packages("corrplot")

> mydata <- superstore[, c(18,19,20,21)]

> View(mydata)

> library(corrplot)

> mydata.cor = cor(mydata)

> mydata.cor

> corrplot(mydata.cor)

Full Screenshot

21
CIS-5270 BUSINESS INTELLIGENCE

6. What are the product types that have been ordered maximum times?

Plot Type – Word Cloud

Function Used – wordcloud

Analysis

Word clouds (also known as text clouds or tag clouds) work in a simple way: the more a

specific word appears in a source of textual data (such as a speech, blog post, or

database), the bigger and bolder it appears in the word cloud. In our case we want to

know what kind of products have been ordered frequently. Looking at the above word

cloud, it is clear product related to “Xerox” has been ordered the most. The product

related to binders, chairs and avery have also been ordered many times.

22
CIS-5270 BUSINESS INTELLIGENCE

Code Used

> install.packages("tm")

> install.packages("SnowballC")

> install.packages("wordcloud")

> install.packages("RColorBrewer")

> library(tm)

> library(SnowballC)

> library(RColorBrewer)

> library(wordcloud)

> wordcloud(words = superstore$Product.Name, min.freq = 1,


+ max.words=100, random.order=FALSE, rot.per=0.35,
+ colors=brewer.pal(8, "Dark2"))

Full Screenshot

23
CIS-5270 BUSINESS INTELLIGENCE

STATISTICAL SUMMARY & FUNCTIONS

1. Statistical Summary

Question - Provide a statistical summary of the Sales.

Answer – Given below is the statistical summary of the Sales:

Statistics Value Meaning


Min.
(Minimum) 0.444 The lowest value of the sales present in the table

The first quartile (Q1) is defined as the middle number


1st Qu.
(First 17.280 between the smallest number and the median of the data
Quartile)
set. It splits off the lowest 25% of data from the highest

75%.

It represents the middle number in a given sequence of


Median 54.490
numbers when it’s ordered by rank.

It is the average of the Sales. It is the summation of all


Mean 229.858
Sales number divided by total number of Sales.

The third quartile (Q3) is defined as the middle number


3rd Qu.
(Third 209.940 between the median and the highest value of the data set.
Quartile)
It splits off the highest 25% of data from the lowest 75%.

Max. 22638.480 The highest value of the sales present in the table.
(Maximum)

24
CIS-5270 BUSINESS INTELLIGENCE

Code Used for Execution

> setwd("~/Desktop/BI")

> superstore<-read.csv("superstore.csv")

> View(superstore)

> summary(superstore$Sales)

Result

Full Screenshot

25
CIS-5270 BUSINESS INTELLIGENCE

2. User Defined Function

Question – What is the total sales for each year for a particular user provided state ?

Answer – As a solution to the above question, we created a user defined function, which

takes state name as input parameter and displays total sales by year for the provided state

by plotting a line graph.

The state name provided by the user is validated to check if the name is there in

superstore table or not. If not present, an error message is shown. If present, the line chart

is plotted to display the result.

Full Screenshot

26
CIS-5270 BUSINESS INTELLIGENCE

Code Screenshot

27
CIS-5270 BUSINESS INTELLIGENCE

Execution Screenshot

Line Chart Screenshot

28
CIS-5270 BUSINESS INTELLIGENCE

Function Code

# Function returns total sales by year for the entered state

statesales<-function(inputstate)

{
# importing libraries

library(tidyr)
library(dplyr)
library(ggplot2)

print(paste("The State provided by the user is: ", inputstate))

# retrieving distinct state name from the table

state_name<-distinct(superstore, State)

# checking if the state provided is correct or not

isvalid<- any(state_name == inputstate)

# if the state name provided is valid, a graph will be plotted

if (isvalid==TRUE)

{
selected<-select(superstore, State, Sales, year)
filtered<- filter(selected,State==inputstate)
aggregated<-aggregate(filtered$Sales,by=list(filtered$year),sum)
print(aggregated)

# plotting line chart

ggplot(data=aggregated, aes(x=Group.1, y=x, group=1)) + geom_line(color="red")


+
geom_point(color="blue")+xlab("Year") + ylab("Total Sales") +
ggtitle("Total Sales by year")
}

else

{ print('Enter correct state name') }

29
CIS-5270 BUSINESS INTELLIGENCE

Execution Script

> setwd("~/Desktop/BI")

> source("sales.R")

> statesales("LA")

[1] "The State provided by the user is: LA"

[1] "Enter correct state name"

> statesales("California")

[1] "The State provided by the user is: California"

Group.1 x

1 15 91303.53

2 16 88443.84

3 17 131551.91

4 18 146388.34

30
CIS-5270 BUSINESS INTELLIGENCE

CODE SUMMARY

1. Data Cleaning Codes

a. Renaming Column

colnames(superstore)[colnames(superstore)=="CT"] <- "City"

b. Removing unwanted Column

superstore = subset(superstore, select = -c(Country) )

c. Duplicating the column and splitting into 3 columns

superstore$order<-superstore$Order.Date

library(tidyr)

superstore<-separate(superstore,order,c("month","day","year"),sep="/")

31
CIS-5270 BUSINESS INTELLIGENCE

2. Visualization Codes

a. Bar Chart

> countsR <- table(superstore$Region)

> barplot(countsR, main="Total Orders by Region",

+ xlab="Region", col="lightblue")

b. Histogram

> hist(superstore$Quantity, main="Frequency Distribution of Quantity

Ordered",

+ xlab="Quantity Ordered", ylab= "Frequency", col="lightpink")

c. Pie Chart

> install.packages("dplyr")

> library("dplyr")

> library(magrittr)

> gd <- superstore %>% group_by(Category) %>% summarize(Sales=sum(Sales))

> pct<-round(gd$Sales/sum(gd$Sales)*100)

> lbls<-paste(gd$Category,pct)

> lbls<-paste(lbls, "%", sep= " ")

> colors = c('lightskyblue','plum2','peachpuff')

> pie(gd$Sales, labels = lbls,main="Percentage Sales By Category",col=colors)

32
CIS-5270 BUSINESS INTELLIGENCE

d. Tree Map

> install.packages("treemap")

> library(treemap)

> treemap(data,index = c("Category","Sub.Category"),vSize ="Sales",vColor =

"Profit",type="value",palette="RdYlGn",range=c(-20000,60000),mapping=c(-

20000,10000,60000),title = "Sales Treemap For categories",fontsize.labels =

c(15,10),align.labels = list(c("centre","centre"),c("left","top")))

e. Correlation Matrix

> install.packages("corrplot")

> mydata <- superstore[, c(18,19,20,21)]

> View(mydata)

> library(corrplot)

> mydata.cor = cor(mydata)

> mydata.cor

> corrplot(mydata.cor)

33
CIS-5270 BUSINESS INTELLIGENCE

f. Word Cloud

> install.packages("tm")

> install.packages("SnowballC")

> install.packages("wordcloud")

> install.packages("RColorBrewer")

> library(tm)

> library(SnowballC)

> library(RColorBrewer)

> library(wordcloud)

> wordcloud(words = superstore$Product.Name, min.freq = 1,

+ max.words=100, random.order=FALSE, rot.per=0.35,

+ colors=brewer.pal(8, "Dark2"))

3. Statistics Summary Code

> setwd("~/Desktop/BI")

> superstore<-read.csv("superstore.csv")

> View(superstore)

> summary(superstore$Sales)

34
CIS-5270 BUSINESS INTELLIGENCE

4. User Defined Function Code

# Function returns total sales by year for the entered state

statesales<-function(inputstate)

{
# importing libraries

library(tidyr)
library(dplyr)
library(ggplot2)

print(paste("The State provided by the user is: ", inputstate))

# retrieving distinct state name from the table

state_name<-distinct(superstore, State)

# checking if the state provided is correct or not

isvalid<- any(state_name == inputstate)

# if the state name provided is valid, a graph will be plotted

if (isvalid==TRUE)

{
selected<-select(superstore, State, Sales, year)
filtered<- filter(selected,State==inputstate)
aggregated<-aggregate(filtered$Sales,by=list(filtered$year),sum)
print(aggregated)

# plotting line chart

ggplot(data=aggregated, aes(x=Group.1, y=x, group=1)) + geom_line(color="red")


+
geom_point(color="blue")+xlab("Year") + ylab("Total Sales") +
ggtitle("Total Sales by year")
}
else
{ print('Enter correct state name') }
}

35

You might also like