Professional Documents
Culture Documents
06 Superstore
06 Superstore
Pushkar Sarkar
2022-10-16
Let’s Begin!
library(tidyverse)
Let’s also import the Sample - Superstore dataset and view it.
View(Sample...Superstore)
Before we do anything, some simple quality of life changes would be nice for us to proceed with. The column
names use capitalisations and fullstops. While there’s nothing wrong with that, it’d be nice to have smallcase
column names. There is a neat little package called janitor which can help us.
install.packages("janitor")
We’ll preserve the original dataset as “Sample. . . Superstore” and create a new data frame called “store” on
which we’ll build every further change and analysis.
library(janitor)
##
## Attaching package: ’janitor’
glimpse(store)
1
## Rows: 9,994
## Columns: 20
## $ order_id <chr> "CA-2016-152156", "CA-2016-152156", "CA-2016-138688", "U~
## $ order_date <chr> "11/8/2016", "11/8/2016", "6/12/2016", "10/11/2015", "10~
## $ ship_date <chr> "11/11/2016", "11/11/2016", "6/16/2016", "10/18/2015", "~
## $ ship_mode <chr> "Second Class", "Second Class", "Second Class", "Standar~
## $ customer_id <chr> "CG-12520", "CG-12520", "DV-13045", "SO-20335", "SO-2033~
## $ customer_name <chr> "Claire Gute", "Claire Gute", "Darrin Van Huff", "Sean O~
## $ segment <chr> "Consumer", "Consumer", "Corporate", "Consumer", "Consum~
## $ country <chr> "United States", "United States", "United States", "Unit~
## $ city <chr> "Henderson", "Henderson", "Los Angeles", "Fort Lauderdal~
## $ state <chr> "Kentucky", "Kentucky", "California", "Florida", "Florid~
## $ postal_code <int> 42420, 42420, 90036, 33311, 33311, 90032, 90032, 90032, ~
## $ region <chr> "South", "South", "West", "South", "South", "West", "Wes~
## $ product_id <chr> "FUR-BO-10001798", "FUR-CH-10000454", "OFF-LA-10000240",~
## $ category <chr> "Furniture", "Furniture", "Office Supplies", "Furniture"~
## $ sub_category <chr> "Bookcases", "Chairs", "Labels", "Tables", "Storage", "F~
## $ product_name <chr> "Bush Somerset Collection Bookcase", "Hon Deluxe Fabric ~
## $ sales <dbl> 261.9600, 731.9400, 14.6200, 957.5775, 22.3680, 48.8600,~
## $ quantity <int> 2, 3, 2, 5, 2, 7, 4, 6, 3, 5, 9, 4, 3, 3, 5, 3, 6, 2, 2,~
## $ discount <dbl> 0.00, 0.00, 0.00, 0.45, 0.20, 0.00, 0.00, 0.20, 0.20, 0.~
## $ profit <dbl> 41.9136, 219.5820, 6.8714, -383.0310, 2.5164, 14.1694, 1~
There are many categorical variables, so let’s run some count functions to get a sense of how many products
there are, how many categories etc. Let’s also check the dates to see the time range of the data.
store %>%
summarise(min = min(order_date),
max = max(order_date))
## min max
## 1 1/1/2017 9/9/2017
Already we run headfirst into a problem! According to this, we only have 9 months of data from 2017, but
a quick glance at the dataset itself shows us that this isn’t true.
The min() and max() functions aren’t working properly. This is probably because the data itself isn’t
properly formatted.
Go back to our glimpse() function. The dates (order_date and ship_date) are “chr” or characters. But R,
just like Excel, actually has it’s own data type for dates.
There is a great package called lubridate which allows us to handle time series data with ease.
install.packages("lubridate")
library(lubridate)
##
## Attaching package: ’lubridate’
2
## The following objects are masked from ’package:base’:
##
## date, intersect, setdiff, union
glimpse(store)
## Rows: 9,994
## Columns: 20
## $ order_id <chr> "CA-2016-152156", "CA-2016-152156", "CA-2016-138688", "U~
## $ order_date <date> 2016-11-08, 2016-11-08, 2016-06-12, 2015-10-11, 2015-10~
## $ ship_date <date> 2016-11-11, 2016-11-11, 2016-06-16, 2015-10-18, 2015-10~
## $ ship_mode <chr> "Second Class", "Second Class", "Second Class", "Standar~
## $ customer_id <chr> "CG-12520", "CG-12520", "DV-13045", "SO-20335", "SO-2033~
## $ customer_name <chr> "Claire Gute", "Claire Gute", "Darrin Van Huff", "Sean O~
## $ segment <chr> "Consumer", "Consumer", "Corporate", "Consumer", "Consum~
## $ country <chr> "United States", "United States", "United States", "Unit~
## $ city <chr> "Henderson", "Henderson", "Los Angeles", "Fort Lauderdal~
## $ state <chr> "Kentucky", "Kentucky", "California", "Florida", "Florid~
## $ postal_code <int> 42420, 42420, 90036, 33311, 33311, 90032, 90032, 90032, ~
## $ region <chr> "South", "South", "West", "South", "South", "West", "Wes~
## $ product_id <chr> "FUR-BO-10001798", "FUR-CH-10000454", "OFF-LA-10000240",~
## $ category <chr> "Furniture", "Furniture", "Office Supplies", "Furniture"~
## $ sub_category <chr> "Bookcases", "Chairs", "Labels", "Tables", "Storage", "F~
## $ product_name <chr> "Bush Somerset Collection Bookcase", "Hon Deluxe Fabric ~
## $ sales <dbl> 261.9600, 731.9400, 14.6200, 957.5775, 22.3680, 48.8600,~
## $ quantity <int> 2, 3, 2, 5, 2, 7, 4, 6, 3, 5, 9, 4, 3, 3, 5, 3, 6, 2, 2,~
## $ discount <dbl> 0.00, 0.00, 0.00, 0.45, 0.20, 0.00, 0.00, 0.20, 0.20, 0.~
## $ profit <dbl> 41.9136, 219.5820, 6.8714, -383.0310, 2.5164, 14.1694, 1~
So we turned order_date into mdy(order_date), and now we have our dates in a date formate. But what
is mdy()? It stands for month, date, year. That’s the format the date is recorded in, so lubridate’s mdy()
function takes that and turns it into the right format.
Lubridate can handle date in any format (01/21/2011 or 01-21-2011 or others) and also has ymd, myd, ydm,
dmy and myd.
store %>%
summarise(min = min(order_date),
max = max(order_date))
## min max
## 1 2014-01-03 2017-12-30
## segment n
3
## 1 Consumer 5191
## 2 Corporate 3020
## 3 Home Office 1783
## category n
## 1 Furniture 2121
## 2 Office Supplies 6026
## 3 Technology 1847
store %>%
group_by(segment, category) %>%
count()
## # A tibble: 9 x 3
## # Groups: segment, category [9]
## segment category n
## <chr> <chr> <int>
## 1 Consumer Furniture 1113
## 2 Consumer Office Supplies 3127
## 3 Consumer Technology 951
## 4 Corporate Furniture 646
## 5 Corporate Office Supplies 1820
## 6 Corporate Technology 554
## 7 Home Office Furniture 362
## 8 Home Office Office Supplies 1079
## 9 Home Office Technology 342
If you think about it, we have what could actually be a table, with the segment as rows and the category as
columns, and n as each cell.
store %>%
group_by(segment, category) %>%
count() %>%
ggplot() + geom_point(aes(category, segment, size = n))
4
Home Office
n
segment
1000
Corporate
2000
3000
Consumer
By adding a third piece of information (the number of products) into the plot, represented as size, we’ve
taken the traditionally useless scatterplot for categorical variables and made a quick visual that shows us
(intuitively) the spread of products across categories and segments.
Ultimately, thought, a better plot is usually a bar chart.
store %>%
group_by(segment, category) %>%
count() %>%
ggplot() + geom_col(aes(category, n, fill = segment))
5
6000
4000
segment
Consumer
n
Corporate
Home Office
2000
store %>%
group_by(segment, category) %>%
count() %>%
ggplot() + geom_col(aes(category, n, fill = segment),
position = "dodge")
6
3000
2000
segment
Consumer
n
Corporate
Home Office
1000
## country n
## 1 United States 9994
## sub_category n
## 1 Accessories 775
## 2 Appliances 466
## 3 Art 796
## 4 Binders 1523
## 5 Bookcases 228
## 6 Chairs 617
## 7 Copiers 68
## 8 Envelopes 254
## 9 Fasteners 217
## 10 Furnishings 957
## 11 Labels 364
7
## 12 Machines 115
## 13 Paper 1370
## 14 Phones 889
## 15 Storage 846
## 16 Supplies 190
## 17 Tables 319
store %>%
count(sub_category) %>%
arrange(n)
## sub_category n
## 1 Copiers 68
## 2 Machines 115
## 3 Supplies 190
## 4 Fasteners 217
## 5 Bookcases 228
## 6 Envelopes 254
## 7 Tables 319
## 8 Labels 364
## 9 Appliances 466
## 10 Chairs 617
## 11 Accessories 775
## 12 Art 796
## 13 Storage 846
## 14 Phones 889
## 15 Furnishings 957
## 16 Paper 1370
## 17 Binders 1523
We used arrange(n) because n is the column name of the count of each thing. Cool trick: we can do
arrange(-n) to flip the order!
store %>%
count(sub_category) %>%
arrange(-n)
## sub_category n
## 1 Binders 1523
## 2 Paper 1370
## 3 Furnishings 957
## 4 Phones 889
## 5 Storage 846
## 6 Art 796
## 7 Accessories 775
## 8 Chairs 617
## 9 Appliances 466
## 10 Labels 364
## 11 Tables 319
## 12 Envelopes 254
8
## 13 Bookcases 228
## 14 Fasteners 217
## 15 Supplies 190
## 16 Machines 115
## 17 Copiers 68
Now let’s dive into some plotting! Let’s try plotting the daily sales over time.
store %>%
ggplot() +
geom_line(aes(order_date, sales))
20000
15000
sales
10000
5000
This is incomprehensible. Before you proceed further, try to list out three reasons why! We’ll tackle all three
of them below.
Let’s return to the dataset and see what’s happening with the dates.
view(store)
So immediately, we can see that there are multiple entries for the same date. So then ggplot2 probably isn’t
plotting the daily sales correctly. it might be drawing 4 lines if there are 4 entries for the same date.
We will need to have a single entry with the combined total sales for a particular day.
store %>%
group_by(order_date) %>%
summarise(sales = sum(sales))
9
## # A tibble: 1,237 x 2
## order_date sales
## <date> <dbl>
## 1 2014-01-03 16.4
## 2 2014-01-04 288.
## 3 2014-01-05 19.5
## 4 2014-01-06 4407.
## 5 2014-01-07 87.2
## 6 2014-01-09 40.5
## 7 2014-01-10 54.8
## 8 2014-01-11 9.94
## 9 2014-01-13 3554.
## 10 2014-01-14 62.0
## # ... with 1,227 more rows
This groups the data by order_date (which gives us one row per date) and then calculates the total sales
for each date!
store %>%
group_by(order_date) %>%
summarise(sales = sum(sales)) %>%
ggplot() + geom_line(aes(order_date, sales))
20000
sales
10000
A slight change, but still not enough. We’ve fixed the first issue though, so let’s move onto the second -
splitting or filtering this by one of its many categorisations.
10
This data contains sales across segments, categories, sub-categories in various cities and states. It’s actually
a huge mess in that sense - lots of different information is currently being combined into our final sales
number for the day.
Let’s filter our data to include only Central sales, and then go back to our group_by() and add segment, so
that we can then break apart the lines by segment.
store %>%
filter(region == "Central") %>%
group_by(order_date, segment) %>%
summarise(sales = sum(sales))
## ‘summarise()‘ has grouped output by ’order_date’. You can override using the
## ‘.groups‘ argument.
## # A tibble: 950 x 3
## # Groups: order_date [720]
## order_date segment sales
## <date> <chr> <dbl>
## 1 2014-01-03 Consumer 16.4
## 2 2014-01-04 Home Office 288.
## 3 2014-01-07 Consumer 87.2
## 4 2014-01-09 Consumer 40.5
## 5 2014-01-20 Consumer 709.
## 6 2014-01-23 Consumer 5.94
## 7 2014-01-26 Corporate 153.
## 8 2014-01-30 Consumer 240.
## 9 2014-02-01 Consumer 469.
## 10 2014-02-06 Consumer 8.95
## # ... with 940 more rows
store %>%
filter(region == "Central") %>%
group_by(order_date, segment) %>%
summarise(sales = sum(sales)) %>%
ggplot() + geom_line(aes(order_date, sales, col = segment))
## ‘summarise()‘ has grouped output by ’order_date’. You can override using the
## ‘.groups‘ argument.
11
15000
segment
10000
sales
Consumer
Corporate
Home Office
5000
So we’ve got what we wanted, but it’s still very difficult to understand. But that’s the second problem
solved. If we wanted to filter it more and more, we could and probably should, if we want to zero in on
specific products and regions.
But at any rate, we’ve done something about the second problem.
The third problem is less of a probably and more of something that’d make our lives as analysts easier, which
is that 365 days in a year ie. 365 x-axis values is impractical to extract information from visually. We could
try to condense these into monthly aggregates and see what comes out then.
For that, we would need a month column, and for that, we would need to do something to our current
order_date column.
Lubridate has a function called month(), which extracts the month (from 1 to 12, or Jan to Dec).
store %>%
mutate(month = month(order_date)) %>%
view()
store %>%
mutate(month = month(order_date)) %>%
filter(region == "Central") %>%
group_by(month, segment) %>%
summarise(sales = sum(sales))
## ‘summarise()‘ has grouped output by ’month’. You can override using the
## ‘.groups‘ argument.
12
## # A tibble: 36 x 3
## # Groups: month [12]
## month segment sales
## <dbl> <chr> <dbl>
## 1 1 Consumer 16479.
## 2 1 Corporate 13060.
## 3 1 Home Office 2145.
## 4 2 Consumer 4078.
## 5 2 Corporate 1712.
## 6 2 Home Office 2422.
## 7 3 Consumer 24791.
## 8 3 Corporate 9109.
## 9 3 Home Office 7317.
## 10 4 Consumer 13723.
## # ... with 26 more rows
store %>%
mutate(month = month(order_date)) %>%
filter(region == "Central") %>%
group_by(month, segment) %>%
summarise(sales = sum(sales)) %>%
ggplot() + geom_line(aes(month, sales, col = segment))
## ‘summarise()‘ has grouped output by ’month’. You can override using the
## ‘.groups‘ argument.
13
40000
30000
segment
sales
Consumer
Corporate
20000 Home Office
10000
0
2.5 5.0 7.5 10.0 12.5
month
Submission
For submission, upload a word doc/pdf with the final plot (above) and add your name to the title. Also, try
to answer the following three questions below!
2. So consumer sales shoot up in September and December, while corporate sales rise in May and October.
Why?
3. Is this pattern true for all regions?
14