You are on page 1of 14

06 - Superstore

Pushkar Sarkar

2022-10-16

Let’s Begin!

libraries & data

Let’s bring the tidyverse into RStudio.

library(tidyverse)

Let’s also import the Sample - Superstore dataset and view it.

View(Sample...Superstore)

Cleaning the data

Before we do anything, some simple quality of life changes would be nice for us to proceed with. The column
names use capitalisations and fullstops. While there’s nothing wrong with that, it’d be nice to have smallcase
column names. There is a neat little package called janitor which can help us.

install.packages("janitor")

We’ll preserve the original dataset as “Sample. . . Superstore” and create a new data frame called “store” on
which we’ll build every further change and analysis.

library(janitor)

## Warning: package ’janitor’ was built under R version 4.0.5

##
## Attaching package: ’janitor’

## The following objects are masked from ’package:stats’:


##
## chisq.test, fisher.test

store <- Sample...Superstore %>%


clean_names()

glimpse(store)

1
## Rows: 9,994
## Columns: 20
## $ order_id <chr> "CA-2016-152156", "CA-2016-152156", "CA-2016-138688", "U~
## $ order_date <chr> "11/8/2016", "11/8/2016", "6/12/2016", "10/11/2015", "10~
## $ ship_date <chr> "11/11/2016", "11/11/2016", "6/16/2016", "10/18/2015", "~
## $ ship_mode <chr> "Second Class", "Second Class", "Second Class", "Standar~
## $ customer_id <chr> "CG-12520", "CG-12520", "DV-13045", "SO-20335", "SO-2033~
## $ customer_name <chr> "Claire Gute", "Claire Gute", "Darrin Van Huff", "Sean O~
## $ segment <chr> "Consumer", "Consumer", "Corporate", "Consumer", "Consum~
## $ country <chr> "United States", "United States", "United States", "Unit~
## $ city <chr> "Henderson", "Henderson", "Los Angeles", "Fort Lauderdal~
## $ state <chr> "Kentucky", "Kentucky", "California", "Florida", "Florid~
## $ postal_code <int> 42420, 42420, 90036, 33311, 33311, 90032, 90032, 90032, ~
## $ region <chr> "South", "South", "West", "South", "South", "West", "Wes~
## $ product_id <chr> "FUR-BO-10001798", "FUR-CH-10000454", "OFF-LA-10000240",~
## $ category <chr> "Furniture", "Furniture", "Office Supplies", "Furniture"~
## $ sub_category <chr> "Bookcases", "Chairs", "Labels", "Tables", "Storage", "F~
## $ product_name <chr> "Bush Somerset Collection Bookcase", "Hon Deluxe Fabric ~
## $ sales <dbl> 261.9600, 731.9400, 14.6200, 957.5775, 22.3680, 48.8600,~
## $ quantity <int> 2, 3, 2, 5, 2, 7, 4, 6, 3, 5, 9, 4, 3, 3, 5, 3, 6, 2, 2,~
## $ discount <dbl> 0.00, 0.00, 0.00, 0.45, 0.20, 0.00, 0.00, 0.20, 0.20, 0.~
## $ profit <dbl> 41.9136, 219.5820, 6.8714, -383.0310, 2.5164, 14.1694, 1~

There are many categorical variables, so let’s run some count functions to get a sense of how many products
there are, how many categories etc. Let’s also check the dates to see the time range of the data.

store %>%
summarise(min = min(order_date),
max = max(order_date))

## min max
## 1 1/1/2017 9/9/2017

Already we run headfirst into a problem! According to this, we only have 9 months of data from 2017, but
a quick glance at the dataset itself shows us that this isn’t true.
The min() and max() functions aren’t working properly. This is probably because the data itself isn’t
properly formatted.
Go back to our glimpse() function. The dates (order_date and ship_date) are “chr” or characters. But R,
just like Excel, actually has it’s own data type for dates.
There is a great package called lubridate which allows us to handle time series data with ease.

install.packages("lubridate")

library(lubridate)

## Warning: package ’lubridate’ was built under R version 4.0.5

##
## Attaching package: ’lubridate’

2
## The following objects are masked from ’package:base’:
##
## date, intersect, setdiff, union

store <- store %>%


mutate(order_date = mdy(order_date),
ship_date = mdy(ship_date))

glimpse(store)

## Rows: 9,994
## Columns: 20
## $ order_id <chr> "CA-2016-152156", "CA-2016-152156", "CA-2016-138688", "U~
## $ order_date <date> 2016-11-08, 2016-11-08, 2016-06-12, 2015-10-11, 2015-10~
## $ ship_date <date> 2016-11-11, 2016-11-11, 2016-06-16, 2015-10-18, 2015-10~
## $ ship_mode <chr> "Second Class", "Second Class", "Second Class", "Standar~
## $ customer_id <chr> "CG-12520", "CG-12520", "DV-13045", "SO-20335", "SO-2033~
## $ customer_name <chr> "Claire Gute", "Claire Gute", "Darrin Van Huff", "Sean O~
## $ segment <chr> "Consumer", "Consumer", "Corporate", "Consumer", "Consum~
## $ country <chr> "United States", "United States", "United States", "Unit~
## $ city <chr> "Henderson", "Henderson", "Los Angeles", "Fort Lauderdal~
## $ state <chr> "Kentucky", "Kentucky", "California", "Florida", "Florid~
## $ postal_code <int> 42420, 42420, 90036, 33311, 33311, 90032, 90032, 90032, ~
## $ region <chr> "South", "South", "West", "South", "South", "West", "Wes~
## $ product_id <chr> "FUR-BO-10001798", "FUR-CH-10000454", "OFF-LA-10000240",~
## $ category <chr> "Furniture", "Furniture", "Office Supplies", "Furniture"~
## $ sub_category <chr> "Bookcases", "Chairs", "Labels", "Tables", "Storage", "F~
## $ product_name <chr> "Bush Somerset Collection Bookcase", "Hon Deluxe Fabric ~
## $ sales <dbl> 261.9600, 731.9400, 14.6200, 957.5775, 22.3680, 48.8600,~
## $ quantity <int> 2, 3, 2, 5, 2, 7, 4, 6, 3, 5, 9, 4, 3, 3, 5, 3, 6, 2, 2,~
## $ discount <dbl> 0.00, 0.00, 0.00, 0.45, 0.20, 0.00, 0.00, 0.20, 0.20, 0.~
## $ profit <dbl> 41.9136, 219.5820, 6.8714, -383.0310, 2.5164, 14.1694, 1~

So we turned order_date into mdy(order_date), and now we have our dates in a date formate. But what
is mdy()? It stands for month, date, year. That’s the format the date is recorded in, so lubridate’s mdy()
function takes that and turns it into the right format.
Lubridate can handle date in any format (01/21/2011 or 01-21-2011 or others) and also has ymd, myd, ydm,
dmy and myd.

store %>%
summarise(min = min(order_date),
max = max(order_date))

## min max
## 1 2014-01-03 2017-12-30

Much better. So we know now that we have 4 years of data.

store %>% count(segment)

## segment n

3
## 1 Consumer 5191
## 2 Corporate 3020
## 3 Home Office 1783

store %>% count(category)

## category n
## 1 Furniture 2121
## 2 Office Supplies 6026
## 3 Technology 1847

Let’s take a moment to do an interesting little plot.


Let’s try counting the categories within the segments. In other words, how many products of each category
are in each segment?

store %>%
group_by(segment, category) %>%
count()

## # A tibble: 9 x 3
## # Groups: segment, category [9]
## segment category n
## <chr> <chr> <int>
## 1 Consumer Furniture 1113
## 2 Consumer Office Supplies 3127
## 3 Consumer Technology 951
## 4 Corporate Furniture 646
## 5 Corporate Office Supplies 1820
## 6 Corporate Technology 554
## 7 Home Office Furniture 362
## 8 Home Office Office Supplies 1079
## 9 Home Office Technology 342

If you think about it, we have what could actually be a table, with the segment as rows and the category as
columns, and n as each cell.

store %>%
group_by(segment, category) %>%
count() %>%
ggplot() + geom_point(aes(category, segment, size = n))

4
Home Office

n
segment

1000
Corporate
2000
3000

Consumer

Furniture Office Supplies Technology


category

By adding a third piece of information (the number of products) into the plot, represented as size, we’ve
taken the traditionally useless scatterplot for categorical variables and made a quick visual that shows us
(intuitively) the spread of products across categories and segments.
Ultimately, thought, a better plot is usually a bar chart.

store %>%
group_by(segment, category) %>%
count() %>%
ggplot() + geom_col(aes(category, n, fill = segment))

5
6000

4000
segment
Consumer
n

Corporate
Home Office

2000

Furniture Office Supplies Technology


category

What happens if you use “col =” instead of “fill =”?


We can also do a variation on this

store %>%
group_by(segment, category) %>%
count() %>%
ggplot() + geom_col(aes(category, n, fill = segment),
position = "dodge")

6
3000

2000
segment
Consumer
n

Corporate
Home Office

1000

Furniture Office Supplies Technology


category

What happens if we use position = “fill” instead?


Let’s go back to the rest of our dataset.

store %>% count(country)

## country n
## 1 United States 9994

So we can drop this column for all further analysis.

store %>% count(sub_category)

## sub_category n
## 1 Accessories 775
## 2 Appliances 466
## 3 Art 796
## 4 Binders 1523
## 5 Bookcases 228
## 6 Chairs 617
## 7 Copiers 68
## 8 Envelopes 254
## 9 Fasteners 217
## 10 Furnishings 957
## 11 Labels 364

7
## 12 Machines 115
## 13 Paper 1370
## 14 Phones 889
## 15 Storage 846
## 16 Supplies 190
## 17 Tables 319

We can arrange this in ascending order.

store %>%
count(sub_category) %>%
arrange(n)

## sub_category n
## 1 Copiers 68
## 2 Machines 115
## 3 Supplies 190
## 4 Fasteners 217
## 5 Bookcases 228
## 6 Envelopes 254
## 7 Tables 319
## 8 Labels 364
## 9 Appliances 466
## 10 Chairs 617
## 11 Accessories 775
## 12 Art 796
## 13 Storage 846
## 14 Phones 889
## 15 Furnishings 957
## 16 Paper 1370
## 17 Binders 1523

We used arrange(n) because n is the column name of the count of each thing. Cool trick: we can do
arrange(-n) to flip the order!

store %>%
count(sub_category) %>%
arrange(-n)

## sub_category n
## 1 Binders 1523
## 2 Paper 1370
## 3 Furnishings 957
## 4 Phones 889
## 5 Storage 846
## 6 Art 796
## 7 Accessories 775
## 8 Chairs 617
## 9 Appliances 466
## 10 Labels 364
## 11 Tables 319
## 12 Envelopes 254

8
## 13 Bookcases 228
## 14 Fasteners 217
## 15 Supplies 190
## 16 Machines 115
## 17 Copiers 68

Now let’s dive into some plotting! Let’s try plotting the daily sales over time.

store %>%
ggplot() +
geom_line(aes(order_date, sales))

20000

15000
sales

10000

5000

2014 2015 2016 2017 2018


order_date

This is incomprehensible. Before you proceed further, try to list out three reasons why! We’ll tackle all three
of them below.
Let’s return to the dataset and see what’s happening with the dates.

view(store)

So immediately, we can see that there are multiple entries for the same date. So then ggplot2 probably isn’t
plotting the daily sales correctly. it might be drawing 4 lines if there are 4 entries for the same date.
We will need to have a single entry with the combined total sales for a particular day.

store %>%
group_by(order_date) %>%
summarise(sales = sum(sales))

9
## # A tibble: 1,237 x 2
## order_date sales
## <date> <dbl>
## 1 2014-01-03 16.4
## 2 2014-01-04 288.
## 3 2014-01-05 19.5
## 4 2014-01-06 4407.
## 5 2014-01-07 87.2
## 6 2014-01-09 40.5
## 7 2014-01-10 54.8
## 8 2014-01-11 9.94
## 9 2014-01-13 3554.
## 10 2014-01-14 62.0
## # ... with 1,227 more rows

This groups the data by order_date (which gives us one row per date) and then calculates the total sales
for each date!

store %>%
group_by(order_date) %>%
summarise(sales = sum(sales)) %>%
ggplot() + geom_line(aes(order_date, sales))

20000
sales

10000

2014 2015 2016 2017 2018


order_date

A slight change, but still not enough. We’ve fixed the first issue though, so let’s move onto the second -
splitting or filtering this by one of its many categorisations.

10
This data contains sales across segments, categories, sub-categories in various cities and states. It’s actually
a huge mess in that sense - lots of different information is currently being combined into our final sales
number for the day.
Let’s filter our data to include only Central sales, and then go back to our group_by() and add segment, so
that we can then break apart the lines by segment.

store %>%
filter(region == "Central") %>%
group_by(order_date, segment) %>%
summarise(sales = sum(sales))

## ‘summarise()‘ has grouped output by ’order_date’. You can override using the
## ‘.groups‘ argument.

## # A tibble: 950 x 3
## # Groups: order_date [720]
## order_date segment sales
## <date> <chr> <dbl>
## 1 2014-01-03 Consumer 16.4
## 2 2014-01-04 Home Office 288.
## 3 2014-01-07 Consumer 87.2
## 4 2014-01-09 Consumer 40.5
## 5 2014-01-20 Consumer 709.
## 6 2014-01-23 Consumer 5.94
## 7 2014-01-26 Corporate 153.
## 8 2014-01-30 Consumer 240.
## 9 2014-02-01 Consumer 469.
## 10 2014-02-06 Consumer 8.95
## # ... with 940 more rows

Now let’s break apart the lines!

store %>%
filter(region == "Central") %>%
group_by(order_date, segment) %>%
summarise(sales = sum(sales)) %>%
ggplot() + geom_line(aes(order_date, sales, col = segment))

## ‘summarise()‘ has grouped output by ’order_date’. You can override using the
## ‘.groups‘ argument.

11
15000

segment
10000
sales

Consumer
Corporate
Home Office

5000

2014 2015 2016 2017 2018


order_date

So we’ve got what we wanted, but it’s still very difficult to understand. But that’s the second problem
solved. If we wanted to filter it more and more, we could and probably should, if we want to zero in on
specific products and regions.
But at any rate, we’ve done something about the second problem.
The third problem is less of a probably and more of something that’d make our lives as analysts easier, which
is that 365 days in a year ie. 365 x-axis values is impractical to extract information from visually. We could
try to condense these into monthly aggregates and see what comes out then.
For that, we would need a month column, and for that, we would need to do something to our current
order_date column.
Lubridate has a function called month(), which extracts the month (from 1 to 12, or Jan to Dec).

store %>%
mutate(month = month(order_date)) %>%
view()

Now let’s add our filters and everything back.

store %>%
mutate(month = month(order_date)) %>%
filter(region == "Central") %>%
group_by(month, segment) %>%
summarise(sales = sum(sales))

## ‘summarise()‘ has grouped output by ’month’. You can override using the
## ‘.groups‘ argument.

12
## # A tibble: 36 x 3
## # Groups: month [12]
## month segment sales
## <dbl> <chr> <dbl>
## 1 1 Consumer 16479.
## 2 1 Corporate 13060.
## 3 1 Home Office 2145.
## 4 2 Consumer 4078.
## 5 2 Corporate 1712.
## 6 2 Home Office 2422.
## 7 3 Consumer 24791.
## 8 3 Corporate 9109.
## 9 3 Home Office 7317.
## 10 4 Consumer 13723.
## # ... with 26 more rows

And finally, let’s plot!

store %>%
mutate(month = month(order_date)) %>%
filter(region == "Central") %>%
group_by(month, segment) %>%
summarise(sales = sum(sales)) %>%
ggplot() + geom_line(aes(month, sales, col = segment))

## ‘summarise()‘ has grouped output by ’month’. You can override using the
## ‘.groups‘ argument.

13
40000

30000
segment
sales

Consumer
Corporate
20000 Home Office

10000

0
2.5 5.0 7.5 10.0 12.5
month

Submission

For submission, upload a word doc/pdf with the final plot (above) and add your name to the title. Also, try
to answer the following three questions below!

1. The x-axis is behaving weirdly. How would you fix it?

2. So consumer sales shoot up in September and December, while corporate sales rise in May and October.
Why?
3. Is this pattern true for all regions?

14

You might also like