Professional Documents
Culture Documents
Exploratory Visualisations
Let's begin with some visualisations of the sales data. For example, we can look at the
total number of sales across the regions as well as the total number of sales by
ProductSubcategory. We will use R's GGPLOT2 library for this.
The most interesting thing about this plot is that there are very few sales for the Central,
Northeast and Southeast regions. The relative imbalance between these three regions
and the rest means that we should be cautious about making any conclusions about
these regions.
The plot above shows that some ProductSubcategories are far more popular than
others, for example "Tires and Tubes" and "Bottles and Cages". We can also see that
Road Bikes and Mountain Bikes form a good portion of overall sales.
Since we are interested in the differences between regions, let's view the relative
proportions of products sold by region:
This plot hints at some interesting differences between the regions, but it also
highlights the limitations of basic summaries to shed light on underlying patterns and
trends. For example we can see that there are two very popular products in Canada.
Also obvious is the different profiles for the Northeast and Southeast - though as
mentioned we won't focus on these regions. Beyond this however, this plot fails to
obviously distinguish the other regions, for example it is very difficult to distinguish
Australia, France, the UK and Germany.
Given that we cannot easily see differences in these regions, the question is whether
there are any differences in purchasing behaviours? Below, we will explore this
using Principal Components Analysis (PCA). When plotted, the principal components
should give us a reasonable indication about which regions are similar and those that
are different. First, we will summarise the number of sales by product and by region:
This creates a table with 10 rows (one for each region) and 158 columns (one for each
ProductKey). To visualise this, we will map the data onto two new dimensions using
PCA. This results in a table that has 10 rows (the regions) and 2 columns which capture
the purchasing behaviours in each region.
Although we haven't shown this here, the first two principal components (which we will
use for visualisation) capture 89 % of the variation in this data. This is a good result, and
suggests that the purchasing behaviours have been well captured by these new
dimensions (visualised below):
From the above plot, we observe that Australia is quite unlike the rest of the regions.
The European regions (UK, Germany and France) are all very similar. Canada, the
Northwest and Southwest are all distinct. This plot hints at some interesting differences
between the regions, which we dig into in more detail in the following section.
Purchasing Behaviours
We will use correspondence analysis (CA) to identify which products are most popular
by region. CA is also known as 'Market Basket Analysis' (for more info, see this blog
post by MapR). First of all, let's see if there is an association between
ProductSubcategories and the various regions. We begin by creating another
contingency table, a count of the number of sales by ProductSubcategory in each
region:
From this plot we can see that Dimension 1 and Dimension 2 explain 75% and 17 % of
the association in the dataset respectively. Dimension 1 seems to be a contrast
between purchasing preferences in Australia versus Canada (again we are ignoring the
Central and East Coast). More specifically we see:
Sales of bikes (road, mountain and touring) contribute a greater proportion of the
Australian sales than they do in other regions.
There is a clear association between the UK and "Tires and Tubes", Helmets and
Caps. Caps are also popular in Germany and France.
The sales of clothing (Socks, Gloves, Vests and Jerseys) is particularly strong on
the West Coast
More generally, we might conclude that sales are predominantly centered on the actual
bikes in Australia, and as we move from left to right (from Australia to Canada in the
plot above), we observe an increase in the sales of clothing and accessories.
Let's drill down further and look at the sales patterns by product. Again, we create a
contingecy table of the number of sales per product by region, and then perform
correspondence analysis:
This plot is a hot mess and very tough to interpret. But in general, we can again see the
contrast between Canada and Australia across the first dimension, and a contrast
between Europe and Australia in the second dimension. Beyond that, it is a bit tough.
The distance between the Region and each product indicates how strongly that product
is associated with that region. So we can see a number of products up in the top right
and a bunch of products in the center which show weak associations. We can improve
this plot by removing weak associations and only focusing on the strong ones:
Rate
★★★★★★★★★★
4.95 (20)
Share
Articles
Editorials
Stairways
Forums
Scripts
QotD
Books
Ask SSC
Blogs
Register
Login
Write for us
The integration of RevolutionR with SQL Server 2016 promises a wider range of
analytical methods and greater flexibility for exploratory, predictive and visual data
analysis. In this post we use RevolutionR to analyse purchasing behvaiours in the
AdventureWorksDW2012 InternetSales schema. We will cover a brief visual summary of
the sales data, detect differences in purchasing patterns across regions of the world
and finally identify these differences using correspondence analysis.
The Data
Our main goal in this analysis is to identify purchasing patterns in the InternetSales
schema of AdventureWorksDW2012. First of all we create a view which draws
information from the factInternetSales fact table and relevant dimensions (dimProduct,
dimProductSubCategory, dimSalesTerritory):
use AdventureWorksDW2012;
go
create view [vw_RSalesAnalysis]
as
select
sales.ProductKey,
p.ProductSubcategoryKey,
ps.EnglishProductSubcategoryName,
p.EnglishProductName,
sales.SalesTerritoryKey,
t.SalesTerritoryRegion,
sales.CustomerKey
from dbo.FactInternetSales as sales
inner join dbo.DimProduct as p on p.ProductKey = sales.ProductKey
inner join dbo.DimProductSubcategory as ps on ps.ProductSubcategoryKey
= p.ProductSubcategoryKey
inner join dbo.DimSalesTerritory as t on t.SalesTerritoryKey =
sales.SalesTerritoryKey;
go
library(RODBC)
library(ggplot2)
library(FactoMineR)
library(reshape2)
## Open Connectioote:
## Note: "RSQLAnalytics2016" is an existing ODBC Data Source
SQLconnection <- odbcConnect("RSQLAnalytics2016",
uid="<username>", pwd="<password>")
## RetrieveData
salesData <- sqlFetch(SQLconnection, "vw_RSalesAnalysis")
## Close connection
odbcClose(SQLconnection)
Above, we used the sqlFetch() function from R's RODBC package to retrieve the data in
the view. Alternatively, the sqlQuery() function allows you to retrieve the results of a
query - this is particularly useful for adhoc analysis. We can view the first few rows
using the head() function:
Categories
RevolutionR
Recent comments
Hi, That blog post about R+SSRS is a great find!
Thanks for that. Re learning R, there are...
nick.dale.burns
stephane.boinon
Share
Rate
★★★★★★★★★★
4.95 (20)
Related content
SQLServerCentral.com
RevolutionR
R Services provides in-database analytics in SQL Server 2016. In this article we step
through configuring R Services and get you started with in-database analytics.
★★★★★★★★★★
4.42 (12)
2018-05-28(first published: 2016-01-27)
4,604 reads
Discuss
Database Journal
JSON
Arshad Ali talks about using OPENJSON to read and parse JSON data and looks at how
to convert it to tabular format.
2016-03-10
4,763 reads
MSSQLTips.com
Rajendra Gupta takes a look at the additional enhancements made to Row Level
Security in SQL Server 2016 CTP 3.1 and explores the new features with examples.
2016-02-11
3,492 reads
Introduction to PolyBase in SQL Server
2016 - Part 1
by Additional Articles
MSSQLTips.com
PolyBase
One of the features introduced in SQL Server 2016 is PolyBase. The goal of PolyBase is
to make interacting with unstructured, semi-structured and non-relational data stored in
Hadoop as easy as writing Transact-SQL statements. Edwin Sarmiento explains.
2015-11-10
3,022 reads
Data Sprawl
by Steve Jones
SQLServerCentral.com
Editorial
The challenges of data growth and sprawl can be compounded by the variety of tools
and platforms available. Steve Jones notes that you might need to learn a bit about
different technologies.
★★★★★★★★★★
3.5 (2)
2015-09-22
119 reads
Discuss
CPU
MEMORY
DISK I/O
WAITS
Take a peek into our servers
About SQLServerCentral
Contact Us
Terms of Use
Privacy Policy
Contribute
Contributors
Authors
Newsletters
The sales of women's mountain shorts are particularly high in Canada and the
West coast of the USA.
In addition the sales of , "HL Mountain Tires", "HL Road Tires" and "Mountain Tire
Tubes" are particularly high in the same regions.
The sale of "Road Tire Tube", "LL Road Tires", "Touring Tyres", "Touring Tyre
Tubes", "AWC Logo Caps" and "Long-sleve Logo Jersey (M)" is particularly strong
in Europe. This confirms our intuition that road racing is likely to be popular
around France.
There are a number of mountain and road bikes which are sell well in Australia.
The "Short Sleeve Classic Jersey" also sells well in Australia.
Conclusions
We have managed to build an interesting picture of the cycling communities and their
purchasing behaviours in Canada / USA compared to Europe and Australia. Our analysis
suggests that mountain biking is particularly popular with women in Canada and the US,
and that products relating to this sell well. This is clearly different from the UK, Germany
and France where road racing appears to be the most popular. In Australia, road biking
and mountain biking is popular, with the sale of bikes forming a major part of their total
sales. Information like this is invaluable for managing stock and for marketing
departments. There is clearly no need to heavily invest in stock and marketing of
mountain bikes in Europe, or road bikes in the US.
So why use R, when we could have performed a Market Basket Analysis using Analysis
Services? For me, more than anything it is the flexibility, convenience and power of R.
Based on Microsoft's very clear investment in R (integration with Azure Machine
Learning and SQL Server 2016), it looks to become a core part of Microsoft's business
intelligence and analytics stack.