You are on page 1of 16

could help the business make decisions about stock levels and targeted marketing

campaigns across the regions.

Exploratory Visualisations
Let's begin with some visualisations of the sales data. For example, we can look at the
total number of sales across the regions as well as the total number of sales by
ProductSubcategory. We will use R's GGPLOT2 library for this.

sales_by_region <- ggplot(salesData,


aes(x=SalesTerritoryRegion, y=..count..)) + 
        geom_bar(fill="steelblue") + 
        theme(axis.text = element_text(size=5, angle=0),
panel.background = element_blank()) + 
        xlab("") +
        coord_flip() +
        ggtitle("Number of Sales by Territory Region")
sales_by_region

The most interesting thing about this plot is that there are very few sales for the Central,
Northeast and Southeast regions. The relative imbalance between these three regions
and the rest means that we should be cautious about making any conclusions about
these regions.

sales_by_cat <- ggplot(salesData,


aes(x=EnglishProductSubcategoryName, y=..count..)) + 
        geom_bar(fill="steelblue") + 
        theme(axis.text = element_text(size=5, angle=0),
panel.background = element_blank()) + 
        xlab("") +
        coord_flip() +
        ggtitle("Number of Sales by ProductSubCategory"
sales_by_cat

The plot above shows that some ProductSubcategories are far more popular than
others, for example "Tires and Tubes" and "Bottles and Cages". We can also see that
Road Bikes and Mountain Bikes form a good portion of overall sales.
Since we are interested in the differences between regions, let's view the relative
proportions of products sold by region:

product_sales_by_region <- ggplot(salesData,


aes(x=ProductKey, y=..density..)) + 
        geom_density(fill="steelblue", alpha=0.8) +
        facet_wrap(~ SalesTerritoryRegion) +
        theme_bw()
product_sales_by_region

This plot hints at some interesting differences between the regions, but it also
highlights the limitations of basic summaries to shed light on underlying patterns and
trends. For example we can see that there are two very popular products in Canada.
Also obvious is the different profiles for the Northeast and Southeast - though as
mentioned we won't focus on these regions. Beyond this however, this plot fails to
obviously distinguish the other regions, for example it is very difficult to distinguish
Australia, France, the UK and Germany. 

Given that we cannot easily see differences in these regions, the question is whether
there are any differences in purchasing behaviours? Below, we will explore this
using Principal Components Analysis (PCA). When plotted, the principal components
should give us a reasonable indication about which regions are similar and those that
are different. First, we will summarise the number of sales by product and by region:

by_region <- with(salesData, table(SalesTerritoryRegion,


ProductKey))

This creates a table with 10 rows (one for each region) and 158 columns (one for each
ProductKey). To visualise this, we will map the data onto two new dimensions using
PCA. This results in a table that has 10 rows (the regions) and 2 columns which capture
the purchasing behaviours in each region.

pca_regions <- prcomp(by_region, scale=TRUE)


pca_data <- data.frame(pca_regions$x)
pca_data$Region <- rownames(pca_data)

Although we haven't shown this here, the first two principal components (which we will
use for visualisation) capture 89 % of the variation in this data. This is a good result, and
suggests that the purchasing behaviours have been well captured by these new
dimensions (visualised below):
From the above plot, we observe that Australia is quite unlike the rest of the regions.
The European regions (UK, Germany and France) are all very similar. Canada, the
Northwest and Southwest are all distinct. This plot hints at some interesting differences
between the regions, which we dig into in more detail in the following section.

Purchasing Behaviours
We will use correspondence analysis (CA) to identify which products are most popular
by region. CA is also known as 'Market Basket Analysis' (for more info, see this blog
post by MapR). First of all, let's see if there is an association between
ProductSubcategories and the various regions. We begin by creating another
contingency table, a count of the number of sales by ProductSubcategory in each
region:

by_region <- with(salesData, table(SalesTerritoryRegion,


EnglishProductSubcategoryName))

Then we will use the CA() function in R's FactoMineR package.

results <- CA(by_region, graph=FALSE)


plot.CA(results, col.row="red", col.col="grey", cex=0.6,
title="Region : Subcategory")

From this plot we can see that Dimension 1 and Dimension 2 explain 75% and 17 % of
the association in the dataset respectively. Dimension 1 seems to be a contrast
between purchasing preferences in Australia versus Canada (again we are ignoring the
Central and East Coast). More specifically we see:

 Sales of bikes (road, mountain and touring) contribute a greater proportion of the
Australian sales than they do in other regions.
 There is a clear association between the UK and "Tires and Tubes", Helmets and
Caps. Caps are also popular in Germany and France.
 The sales of clothing (Socks, Gloves, Vests and Jerseys) is particularly strong on
the West Coast

More generally, we might conclude that sales are predominantly centered on the actual
bikes in Australia, and as we move from left to right (from Australia to Canada in the
plot above), we observe an increase in the sales of clothing and accessories.
Let's drill down further and look at the sales patterns by product. Again, we create a
contingecy table of the number of sales per product by region, and then perform
correspondence analysis:

by_region <- with(salesData, table(SalesTerritoryRegion,


ProductKey))
results <- CA(by_region, graph=FALSE)
plot.CA(results, col.row="red", col.col="grey", cex=0.6,
title="Region : Product")

This plot is a hot mess and very tough to interpret. But in general, we can again see the
contrast between Canada and Australia across the first dimension, and a contrast
between Europe and Australia in the second dimension. Beyond that, it is a bit tough.
The distance between the Region and each product indicates how strongly that product
is associated with that region. So we can see a number of products up in the top right
and a bunch of products in the center which show weak associations. We can improve
this plot by removing weak associations and only focusing on the strong ones:

## get strength of associations, include the ProductKey in this


strength <- data.frame(results$col$contrib[, 1:2])
strength$ProductKey<- rownames(product_contributions
## find the strongest associations (values > 1.5)
## loosely corresponds to associations outside the 95%
confidence interval
major_products <- strength[strength$value > 1.5, "ProductKey"]
## from the sales data, identify which rows belong to the
'strong' products
## and retrieve only these rows
idx_major <- which(salesData$ProductKey %in% major_products)
majorSalesData <- salesData[idx_major, ]
## repeat the Correspondence Analysis
## (create contingency table and perform analysis)
by_region_filter <- with(majorSalesData,
table(SalesTerritoryRegion, ProductKey))
results_filter <- CA(by_region_trimmed, graph=FALSE)
## now create a new data frame for visualisation
## get the coordinates for both the products and regions
products <- data.frame(results_filter$col$coord)
regions <- data.frame(results_filter$row$coord)
## get the product names for visualisation
product_names <- unique(salesData[, c("ProductKey",
"EnglishProductName")])
get_product_names <- function (key)
product_names$EnglishProductName[which(product_names$ProductKey==
key)]
## Finally, visualise
ggplot(products, aes(x=Dim.1, y=Dim.2)) + 
    geom_text(data=products, aes(x=Dim.1, y=Dim.2),
              label=sapply(rownames(products),
get_product_names),
              colour="darkblue",
              size=2,
              alpha=0.7) +
    geom_point(data=regions, aes(x=Dim.1, y=Dim.2),
colour="red", shape=4) +
    geom_text(data=regions, aes(x=Dim.1, y=Dim.2),
label=rownames(regions), colour="red", size=4, vjust=-
1,alpha=0.4) +
    theme_bw() +
    ggtitle("Region : Product")

Read 7,718 times


(11 in last 30 days)

Rate
★★★★★★★★★★
 
4.95 (20)

Log in or register to rate

Share

 Articles
 Editorials
 Stairways
 Forums
 Scripts
 QotD
 Books
 Ask SSC
 Blogs
 Register
 Login

 Write for us

Analysing Sales Patterns: R + SQL Server


Nick Burns, 2017-08-18 (first published: 2015-12-15)

The integration of RevolutionR with SQL Server 2016 promises a wider range of
analytical methods and greater flexibility for exploratory, predictive and visual data
analysis. In this post we use RevolutionR to analyse purchasing behvaiours in the
AdventureWorksDW2012 InternetSales schema. We will cover a brief visual summary of
the sales data, detect differences in purchasing patterns across regions of the world
and finally identify these differences using correspondence analysis.

The Data
Our main goal in this analysis is to identify purchasing patterns in the InternetSales
schema of AdventureWorksDW2012. First of all we create a view which draws
information from the factInternetSales fact table and relevant dimensions (dimProduct,
dimProductSubCategory, dimSalesTerritory):

use AdventureWorksDW2012;
go
create view [vw_RSalesAnalysis]
as
select
sales.ProductKey,
p.ProductSubcategoryKey,
ps.EnglishProductSubcategoryName,
p.EnglishProductName,
sales.SalesTerritoryKey,
t.SalesTerritoryRegion,
sales.CustomerKey
from dbo.FactInternetSales as sales
inner join dbo.DimProduct as p on p.ProductKey = sales.ProductKey
inner join dbo.DimProductSubcategory as ps on ps.ProductSubcategoryKey
= p.ProductSubcategoryKey
inner join dbo.DimSalesTerritory as t on t.SalesTerritoryKey =
sales.SalesTerritoryKey;
go

From RStudio we can access this view using an ODBC connection:

library(RODBC)
library(ggplot2)
library(FactoMineR)
library(reshape2)
## Open Connectioote:
## Note: "RSQLAnalytics2016" is an existing ODBC Data Source
SQLconnection <- odbcConnect("RSQLAnalytics2016",
uid="<username>", pwd="<password>")
## RetrieveData
salesData <- sqlFetch(SQLconnection, "vw_RSalesAnalysis")
## Close connection
odbcClose(SQLconnection)

Above, we used the sqlFetch() function from R's RODBC package to retrieve the data in
the view. Alternatively, the sqlQuery() function allows you to retrieve the results of a
query - this is particularly useful for adhoc analysis. We can view the first few rows
using the head() function:

## Look at the first few rows


head(salesData)

Pr Produ English Engl Sal Sale C


o ctSub Product ishP esT sTer us
d categ Subcate rodu errit ritor to
u oryK goryNa ctNa ory yRe m
ct ey me me Key gion er
K K
e ey
y

3 2 Road Roa 6 Can 21


1 Bikes d- ada 76
0 150 8
Red,
62

3 1 Mountai Mou 7 Fran 28


4 n Bikes ntain ce 38
6 -100 9
Silv
er,
44

3 1 Mountai Mou 1 Nort 25


4 n Bikes ntain hwe 86
6 -100 st 3
Silv
er,
44

3 2 Road Roa 4 Sout 14


3 Bikes d- hwe 50
6 650 st 1
Blac
k, 62

3 1 Mountai Mou 9 Aust 11


4 n Bikes ntain ralia 00
6 -100 3
Silv
er,
44
 The sales data has one row for every sale made, and includes the product and
customer keys along with relevant information about which region the sales was
recorded in. We are interested in whether different regions exhibit different
purchasing patterns which

Categories
 RevolutionR

Join the discussion and add your comment

Recent comments
 Hi, That blog post about R+SSRS is a great find!
Thanks for that. Re learning R, there are...

nick.dale.burns

 hello nick Many thanks for this detailed feedback


on my prior questions. Regarding Tableau integ...

stephane.boinon

 Hi there, Thanks for your questions, I will try my


best to answer them. Yes, the graphs here ...
nick.dale.burns

Share

Rate
★★★★★★★★★★
 
4.95 (20)

Log in or register to rate

Related content

Configuring R Services in SQL Server 2016


 by Nick Burns
 
 

 SQLServerCentral.com

 
 

 RevolutionR

R Services provides in-database analytics in SQL Server 2016. In this article we step
through configuring R Services and get you started with in-database analytics.

★★★★★★★★★★
 
4.42 (12)
2018-05-28(first published: 2016-01-27)
4,604 reads

 Discuss

Getting Started with JSON Support in SQL


Server 2016 – Part 2
 by Additional Articles

 
 

 Database Journal

 
 

 JSON
Arshad Ali talks about using OPENJSON to read and parse JSON data and looks at how
to convert it to tabular format.

2016-03-10
4,763 reads

SQL Server 2016 Row Level Security Block


and Filter Predicates Example
 by Additional Articles

 
 

 MSSQLTips.com

 
 

 SQL Server 2016

Rajendra Gupta takes a look at the additional enhancements made to Row Level
Security in SQL Server 2016 CTP 3.1 and explores the new features with examples.

2016-02-11
3,492 reads


Introduction to PolyBase in SQL Server
2016 - Part 1
 by Additional Articles

 
 

 MSSQLTips.com

 
 

 PolyBase

One of the features introduced in SQL Server 2016 is PolyBase. The goal of PolyBase is
to make interacting with unstructured, semi-structured and non-relational data stored in
Hadoop as easy as writing Transact-SQL statements. Edwin Sarmiento explains.

2015-11-10
3,022 reads

Data Sprawl
 by Steve Jones

 
 

 SQLServerCentral.com
 
 

 Editorial

The challenges of data growth and sprawl can be compounded by the variety of tools
and platforms available. Steve Jones notes that you might need to learn a bit about
different technologies.

★★★★★★★★★★
 
3.5 (2)
2015-09-22
119 reads

 Discuss

CPU
MEMORY
DISK I/O
WAITS
Take a peek into our servers

 About SQLServerCentral
 Contact Us
 Terms of Use
 Privacy Policy
 Contribute
 Contributors
 Authors
 Newsletters

Finally, from this plot we can conclude:

 The sales of women's mountain shorts are particularly high in Canada and the
West coast of the USA.
 In addition the sales of , "HL Mountain Tires", "HL Road Tires" and "Mountain Tire
Tubes" are particularly high in the same regions.
 The sale of "Road Tire Tube", "LL Road Tires", "Touring Tyres", "Touring Tyre
Tubes", "AWC Logo Caps" and "Long-sleve Logo Jersey (M)" is particularly strong
in Europe. This confirms our intuition that road racing is likely to be popular
around France.
 There are a number of mountain and road bikes which are sell well in Australia.
The "Short Sleeve Classic Jersey" also sells well in Australia.

Conclusions
We have managed to build an interesting picture of the cycling communities and their
purchasing behaviours in Canada / USA compared to Europe and Australia. Our analysis
suggests that mountain biking is particularly popular with women in Canada and the US,
and that products relating to this sell well. This is clearly different from the UK, Germany
and France where road racing appears to be the most popular. In Australia, road biking
and mountain biking is popular, with the sale of bikes forming a major part of their total
sales. Information like this is invaluable for managing stock and for marketing
departments. There is clearly no need to heavily invest in stock and marketing of
mountain bikes in Europe, or road bikes in the US.

So why use R, when we could have performed a Market Basket Analysis using Analysis
Services? For me, more than anything it is the flexibility, convenience and power of R.
Based on Microsoft's very clear investment in R (integration with Azure Machine
Learning and SQL Server 2016), it looks to become a core part of Microsoft's business
intelligence and analytics stack.

You might also like