06 Superstore

06 - Superstore
Pushkar Sarkar
2022-10-16
Let’s Begin!
libraries & data
Let’s bring the tidyverse into RStudio.
library(tidyverse)
Let’s also import the Sample - Superstore dataset and view it.
View(Sample...Superstore)
Cleaning the data
Before we do anything, some simple quality of life changes would be nice for us to proceed with. The column
names use capitalisations and fullstops. While there’s nothing wrong with that, it’d be nice to have smallcase
column names. There is a neat little package called janitor which can help us.
install.packages("janitor")
We’ll preserve the original dataset as “Sample. . . Superstore” and create a new data frame called “store” on
which we’ll build every further change and analysis.
library(janitor)
## Warning: package ’janitor’ was built under R version 4.0.5
##
## Attaching package: ’janitor’
## The following objects are masked from ’package:stats’:

##
## chisq.test, fisher.test
store <- Sample...Superstore %>%

clean_names()
glimpse(store)
1
## Rows: 9,994
## Columns: 20
## $ order_id <chr> "CA-2016-152156", "CA-2016-152156", "CA-2016-138688", "U~
## $ order_date <chr> "11/8/2016", "11/8/2016", "6/12/2016", "10/11/2015", "10~
## $ ship_date <chr> "11/11/2016", "11/11/2016", "6/16/2016", "10/18/2015", "~
## $ ship_mode <chr> "Second Class", "Second Class", "Second Class", "Standar~
## $ customer_id <chr> "CG-12520", "CG-12520", "DV-13045", "SO-20335", "SO-2033~
## $ customer_name <chr> "Claire Gute", "Claire Gute", "Darrin Van Huff", "Sean O~
## $ segment <chr> "Consumer", "Consumer", "Corporate", "Consumer", "Consum~
## $ country <chr> "United States", "United States", "United States", "Unit~
## $ city <chr> "Henderson", "Henderson", "Los Angeles", "Fort Lauderdal~
## $ state <chr> "Kentucky", "Kentucky", "California", "Florida", "Florid~
## $ postal_code <int> 42420, 42420, 90036, 33311, 33311, 90032, 90032, 90032, ~
## $ region <chr> "South", "South", "West", "South", "South", "West", "Wes~
## $ product_id <chr> "FUR-BO-10001798", "FUR-CH-10000454", "OFF-LA-10000240",~
## $ category <chr> "Furniture", "Furniture", "Office Supplies", "Furniture"~
## $ sub_category <chr> "Bookcases", "Chairs", "Labels", "Tables", "Storage", "F~
## $ product_name <chr> "Bush Somerset Collection Bookcase", "Hon Deluxe Fabric ~
## $ sales <dbl> 261.9600, 731.9400, 14.6200, 957.5775, 22.3680, 48.8600,~
## $ quantity <int> 2, 3, 2, 5, 2, 7, 4, 6, 3, 5, 9, 4, 3, 3, 5, 3, 6, 2, 2,~
## $ discount <dbl> 0.00, 0.00, 0.00, 0.45, 0.20, 0.00, 0.00, 0.20, 0.20, 0.~
## $ profit <dbl> 41.9136, 219.5820, 6.8714, -383.0310, 2.5164, 14.1694, 1~
There are many categorical variables, so let’s run some count functions to get a sense of how many products
there are, how many categories etc. Let’s also check the dates to see the time range of the data.
store %>%
summarise(min = min(order_date),
max = max(order_date))
## min max
## 1 1/1/2017 9/9/2017
Already we run headfirst into a problem! According to this, we only have 9 months of data from 2017, but
a quick glance at the dataset itself shows us that this isn’t true.
The min() and max() functions aren’t working properly. This is probably because the data itself isn’t
properly formatted.
Go back to our glimpse() function. The dates (order_date and ship_date) are “chr” or characters. But R,
just like Excel, actually has it’s own data type for dates.
There is a great package called lubridate which allows us to handle time series data with ease.
install.packages("lubridate")
library(lubridate)
## Warning: package ’lubridate’ was built under R version 4.0.5
##
## Attaching package: ’lubridate’
2
## The following objects are masked from ’package:base’:
##
## date, intersect, setdiff, union
store <- store %>%

mutate(order_date = mdy(order_date),
ship_date = mdy(ship_date))
glimpse(store)
## Rows: 9,994
## Columns: 20
## $ order_id <chr> "CA-2016-152156", "CA-2016-152156", "CA-2016-138688", "U~
## $ order_date <date> 2016-11-08, 2016-11-08, 2016-06-12, 2015-10-11, 2015-10~
## $ ship_date <date> 2016-11-11, 2016-11-11, 2016-06-16, 2015-10-18, 2015-10~
## $ ship_mode <chr> "Second Class", "Second Class", "Second Class", "Standar~
## $ customer_id <chr> "CG-12520", "CG-12520", "DV-13045", "SO-20335", "SO-2033~
## $ customer_name <chr> "Claire Gute", "Claire Gute", "Darrin Van Huff", "Sean O~
## $ segment <chr> "Consumer", "Consumer", "Corporate", "Consumer", "Consum~
## $ country <chr> "United States", "United States", "United States", "Unit~
## $ city <chr> "Henderson", "Henderson", "Los Angeles", "Fort Lauderdal~
## $ state <chr> "Kentucky", "Kentucky", "California", "Florida", "Florid~
## $ postal_code <int> 42420, 42420, 90036, 33311, 33311, 90032, 90032, 90032, ~
## $ region <chr> "South", "South", "West", "South", "South", "West", "Wes~
## $ product_id <chr> "FUR-BO-10001798", "FUR-CH-10000454", "OFF-LA-10000240",~
## $ category <chr> "Furniture", "Furniture", "Office Supplies", "Furniture"~
## $ sub_category <chr> "Bookcases", "Chairs", "Labels", "Tables", "Storage", "F~
## $ product_name <chr> "Bush Somerset Collection Bookcase", "Hon Deluxe Fabric ~
## $ sales <dbl> 261.9600, 731.9400, 14.6200, 957.5775, 22.3680, 48.8600,~
## $ quantity <int> 2, 3, 2, 5, 2, 7, 4, 6, 3, 5, 9, 4, 3, 3, 5, 3, 6, 2, 2,~
## $ discount <dbl> 0.00, 0.00, 0.00, 0.45, 0.20, 0.00, 0.00, 0.20, 0.20, 0.~
## $ profit <dbl> 41.9136, 219.5820, 6.8714, -383.0310, 2.5164, 14.1694, 1~
So we turned order_date into mdy(order_date), and now we have our dates in a date formate. But what
is mdy()? It stands for month, date, year. That’s the format the date is recorded in, so lubridate’s mdy()
function takes that and turns it into the right format.
Lubridate can handle date in any format (01/21/2011 or 01-21-2011 or others) and also has ymd, myd, ydm,
dmy and myd.
store %>%
summarise(min = min(order_date),
max = max(order_date))
## min max
## 1 2014-01-03 2017-12-30
Much better. So we know now that we have 4 years of data.
store %>% count(segment)
## segment n
3
## 1 Consumer 5191
## 2 Corporate 3020
## 3 Home Office 1783
store %>% count(category)
## category n
## 1 Furniture 2121
## 2 Office Supplies 6026
## 3 Technology 1847
Let’s take a moment to do an interesting little plot.

Let’s try counting the categories within the segments. In other words, how many products of each category
are in each segment?
store %>%
group_by(segment, category) %>%
count()
## # A tibble: 9 x 3
## # Groups: segment, category [9]
## segment category n
## <chr> <chr> <int>
## 1 Consumer Furniture 1113
## 2 Consumer Office Supplies 3127
## 3 Consumer Technology 951
## 4 Corporate Furniture 646
## 5 Corporate Office Supplies 1820
## 6 Corporate Technology 554
## 7 Home Office Furniture 362
## 8 Home Office Office Supplies 1079
## 9 Home Office Technology 342
If you think about it, we have what could actually be a table, with the segment as rows and the category as
columns, and n as each cell.
store %>%
count() %>%
ggplot() + geom_point(aes(category, segment, size = n))
4
Home Office
n
segment
1000
Corporate
2000
3000
Consumer
Furniture Office Supplies Technology

category
By adding a third piece of information (the number of products) into the plot, represented as size, we’ve
taken the traditionally useless scatterplot for categorical variables and made a quick visual that shows us
(intuitively) the spread of products across categories and segments.
Ultimately, thought, a better plot is usually a bar chart.
store %>%
count() %>%
ggplot() + geom_col(aes(category, n, fill = segment))
5
6000
4000
segment
Consumer
n
Corporate
Home Office
2000

category
What happens if you use “col =” instead of “fill =”?

We can also do a variation on this
store %>%
count() %>%
ggplot() + geom_col(aes(category, n, fill = segment),
position = "dodge")
6
3000
2000
segment
Consumer
n
Corporate
Home Office
1000

category
What happens if we use position = “fill” instead?

Let’s go back to the rest of our dataset.
store %>% count(country)
## country n
## 1 United States 9994
So we can drop this column for all further analysis.
store %>% count(sub_category)
## sub_category n
## 1 Accessories 775
## 2 Appliances 466
## 3 Art 796
## 4 Binders 1523
## 5 Bookcases 228
## 6 Chairs 617
## 7 Copiers 68
## 8 Envelopes 254
## 9 Fasteners 217
## 10 Furnishings 957
## 11 Labels 364
7
## 12 Machines 115
## 13 Paper 1370
## 14 Phones 889
## 15 Storage 846
## 16 Supplies 190
## 17 Tables 319
We can arrange this in ascending order.
store %>%
count(sub_category) %>%
arrange(n)
## sub_category n
## 1 Copiers 68
## 2 Machines 115
## 3 Supplies 190
## 4 Fasteners 217
## 5 Bookcases 228
## 6 Envelopes 254
## 7 Tables 319
## 8 Labels 364
## 9 Appliances 466
## 10 Chairs 617
## 12 Art 796
## 13 Storage 846
## 14 Phones 889
## 16 Paper 1370
## 17 Binders 1523
We used arrange(n) because n is the column name of the count of each thing. Cool trick: we can do
arrange(-n) to flip the order!
store %>%
count(sub_category) %>%
arrange(-n)
## sub_category n
## 1 Binders 1523
## 2 Paper 1370
## 4 Phones 889
## 5 Storage 846
## 6 Art 796
## 8 Chairs 617
## 9 Appliances 466
## 10 Labels 364
## 11 Tables 319
## 12 Envelopes 254
8
## 13 Bookcases 228
## 14 Fasteners 217
## 15 Supplies 190
## 16 Machines 115
## 17 Copiers 68
Now let’s dive into some plotting! Let’s try plotting the daily sales over time.
store %>%
ggplot() +
geom_line(aes(order_date, sales))
20000
15000
sales
10000
5000
2014 2015 2016 2017 2018

order_date
This is incomprehensible. Before you proceed further, try to list out three reasons why! We’ll tackle all three
of them below.
Let’s return to the dataset and see what’s happening with the dates.
view(store)
So immediately, we can see that there are multiple entries for the same date. So then ggplot2 probably isn’t
plotting the daily sales correctly. it might be drawing 4 lines if there are 4 entries for the same date.
We will need to have a single entry with the combined total sales for a particular day.
store %>%
group_by(order_date) %>%
summarise(sales = sum(sales))
9
## # A tibble: 1,237 x 2
## order_date sales
## <date> <dbl>
## 1 2014-01-03 16.4
## 2 2014-01-04 288.
## 3 2014-01-05 19.5
## 4 2014-01-06 4407.
## 5 2014-01-07 87.2
## 6 2014-01-09 40.5
## 7 2014-01-10 54.8
## 8 2014-01-11 9.94
## 9 2014-01-13 3554.
## 10 2014-01-14 62.0
## # ... with 1,227 more rows
This groups the data by order_date (which gives us one row per date) and then calculates the total sales
for each date!
store %>%
group_by(order_date) %>%
summarise(sales = sum(sales)) %>%
ggplot() + geom_line(aes(order_date, sales))
20000
sales
10000
2014 2015 2016 2017 2018

order_date
A slight change, but still not enough. We’ve fixed the first issue though, so let’s move onto the second -
splitting or filtering this by one of its many categorisations.
10
This data contains sales across segments, categories, sub-categories in various cities and states. It’s actually
a huge mess in that sense - lots of different information is currently being combined into our final sales
number for the day.
Let’s filter our data to include only Central sales, and then go back to our group_by() and add segment, so
that we can then break apart the lines by segment.
store %>%
filter(region == "Central") %>%
group_by(order_date, segment) %>%
## ‘summarise()‘ has grouped output by ’order_date’. You can override using the
## ‘.groups‘ argument.
## # A tibble: 950 x 3
## # Groups: order_date [720]
## order_date segment sales
## <date> <chr> <dbl>
## 1 2014-01-03 Consumer 16.4
## 2 2014-01-04 Home Office 288.
## 3 2014-01-07 Consumer 87.2
## 4 2014-01-09 Consumer 40.5
## 5 2014-01-20 Consumer 709.
## 6 2014-01-23 Consumer 5.94
## 7 2014-01-26 Corporate 153.
## 8 2014-01-30 Consumer 240.
## 9 2014-02-01 Consumer 469.
## 10 2014-02-06 Consumer 8.95
## # ... with 940 more rows
Now let’s break apart the lines!
store %>%
group_by(order_date, segment) %>%
ggplot() + geom_line(aes(order_date, sales, col = segment))
## ‘summarise()‘ has grouped output by ’order_date’. You can override using the
11
15000
segment
10000
sales
Consumer
Corporate
Home Office
5000
2014 2015 2016 2017 2018

order_date
So we’ve got what we wanted, but it’s still very difficult to understand. But that’s the second problem
solved. If we wanted to filter it more and more, we could and probably should, if we want to zero in on
specific products and regions.
But at any rate, we’ve done something about the second problem.
The third problem is less of a probably and more of something that’d make our lives as analysts easier, which
is that 365 days in a year ie. 365 x-axis values is impractical to extract information from visually. We could
try to condense these into monthly aggregates and see what comes out then.
For that, we would need a month column, and for that, we would need to do something to our current
order_date column.
Lubridate has a function called month(), which extracts the month (from 1 to 12, or Jan to Dec).
store %>%
mutate(month = month(order_date)) %>%
view()
Now let’s add our filters and everything back.
store %>%
group_by(month, segment) %>%
## ‘summarise()‘ has grouped output by ’month’. You can override using the
12
## # A tibble: 36 x 3
## # Groups: month [12]
## month segment sales
## <dbl> <chr> <dbl>
## 1 1 Consumer 16479.
## 2 1 Corporate 13060.
## 3 1 Home Office 2145.
## 4 2 Consumer 4078.
## 5 2 Corporate 1712.
## 7 3 Consumer 24791.
## 8 3 Corporate 9109.
## 10 4 Consumer 13723.
## # ... with 26 more rows
And finally, let’s plot!
store %>%
group_by(month, segment) %>%
ggplot() + geom_line(aes(month, sales, col = segment))
## ‘summarise()‘ has grouped output by ’month’. You can override using the
13
40000
30000
segment
sales
Consumer
Corporate
20000 Home Office
10000
0
2.5 5.0 7.5 10.0 12.5
month
Submission
For submission, upload a word doc/pdf with the final plot (above) and add your name to the title. Also, try
to answer the following three questions below!
1. The x-axis is behaving weirdly. How would you fix it?
2. So consumer sales shoot up in September and December, while corporate sales rise in May and October.
Why?
3. Is this pattern true for all regions?
14

06 Superstore

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

06 Superstore

Uploaded by

Copyright:

Available Formats

06 - Superstore

libraries & data

Let’s bring the tidyverse into RStudio.

Cleaning the data

## Warning: package ’janitor’ was built under R version 4.0.5

## The following objects are masked from ’package:stats’:

store <- Sample...Superstore %>%

## Warning: package ’lubridate’ was built under R version 4.0.5

store <- store %>%

Much better. So we know now that we have 4 years of data.

store %>% count(segment)

store %>% count(category)

Let’s take a moment to do an interesting little plot.

Furniture Office Supplies Technology

Furniture Office Supplies Technology

What happens if you use “col =” instead of “fill =”?

Furniture Office Supplies Technology

What happens if we use position = “fill” instead?

store %>% count(country)

So we can drop this column for all further analysis.

store %>% count(sub_category)

We can arrange this in ascending order.

2014 2015 2016 2017 2018

2014 2015 2016 2017 2018

Now let’s break apart the lines!

2014 2015 2016 2017 2018

Now let’s add our filters and everything back.

And finally, let’s plot!

1. The x-axis is behaving weirdly. How would you fix it?

You might also like