You are on page 1of 33

CA-ONE B9DA102 Data Storage Solutions for

Data Analytics

Submitted by - Akhilesh Sharma

Msc. in Data Analytics, Batch - April 2019

Dublin Business School

July 21, 2019

1
Index
1. Proof of concept data warehouse ................................................ 3
2. Dimensional modelling AdventureWorks .................................... 3
3. Reasons for selecting the subject area ........................................ 3
4. Identify key stakeholders ............................................................ 3
5. Vision and goals and requirements for the data warehouse ........ 3
6. The reasons behind selecting star schema ................................... 4
7.Identify the Dimensions ................................................................ 4
8. Defining the Fact ......................................................................... 5
9. Star Schema Screen ..................................................................... 6
10. ETL for dimension tables by SQL Server .................................. 7
11. ETL by SSRS ............................................................................ 8
12. ETL for dimension tables by SQL Server ................................. 10
13. Visualization in R ..................................................................... 11
14. Visualization and Reports in Tableau ....................................... 12
15. Reports in SSRS ....................................................................... 16
16. XML validation Screenshot ....................................................... 18
17. XML code and schema .............................................................. 19
18. Schema for Neo4j ...................................................................... 22
19. Cypher to create Indexes and Constraints ................................ 23
20. Cypher to create Nodes and relationships ................................. 24
21. Results drawn from Neo4j ......................................................... 25
22. References .................................................................................. 29
23. Apendix A .................................................................................. 30
24. Apendix B .................................................................................. 31

2
QUESTION 1
Develop a proof of concept data warehouse / data mart using dimensional modelling by
capturing data from an existing data source(s). Preferably do not use North-wind database
as it will be used for demonstration purposes during lectures. Document your reasons for
selecting the subject area(s), identify key stakeholders, formalise the vision and goals and
requirements for developing the data warehouse.

Solution : A Data warehouse is a collection of data from several OLTP databases of enterprise
to support BI operations to define a global view or decision making. It is created with
intelligent algorithms to store data from database(s), so that later it can be retrieved in the
same way when needed.

I have selected AdventureWorks as my subject area to study, which is a sample database


from Microsoft about a company operating in manufacturing of cycles in USA and selling it
internationally.

i. Reasons for selecting the subject area : It was published with SQL server 2005 and had
been regularly updated with upcoming versions. We can say it is well published and well
documented. We can get several ideas of what people did before creating a data warehouse
from it and draw respective conclusions, thus we can easily create our own Data warehouse
and do further research on it. It also provides insight into how - sales, budgeting, inventory
and employ management works in a company, which can be a great learning experience as
well as a good chance to make efficient reports.

ii. Key stakeholders : My key stakeholder in adventure works is :

• Sales Manager

iii. Vision Our Vision is to improve sales for adventure works cycles by generating some reports
on previous sales for the sales manager.

iv. Goals and Requirements for developing the data warehouse : My goals and require-
ments to construct a data warehouse for adventure works cycles are as follows :

• Sales amounts as per territories against the year of sale.


• Sales representative’s performance.
• Top 5 sales representatives, and top and least performing among them.
• Internet sales vs Re-seller sales yearly and in total.

3
QUESTION 2
Develop and present a suitable schema for the data warehouse. Discuss your reasons for the
design.

Solution : I have selected star schema to design data warehouse to study adventure works. The
reasons behind selecting star schema are as below-

• Star schema is easy to develop and maintain. The execution time for queries is much
efficient than snowflake or galaxy schema.
• Star schema simplifies business logic, which also reduces data loading time into ware-
house.
• It can be used by relational online analytical processing operations directly.
• Star schema is even faster than OLTP systems as queries run faster because it has lesser
number of tables and clear joins.
• It maintains referential integrity when data is loaded as each row in dimensional table
has a primary key, which is a foreign key in fact table.

To create my data warehouse I went through following steps :

i. Selecting a Business Process or Objective : Business objective for adventure works data
warehouse is to identify sales done by each territory, by internet or re-seller and performance
of sales representatives.

ii. Defining a Grain : For accomplishing our defined business process, our grain is daily sales

iii. Identify the Dimensions : The dimensions or attribute we need for our grain are - Terri-
tory, Sales representatives, Date.

4
iv. Defining the Fact : The fact we need to consider for our grain ”Daily Sales” is sales trans-
action details, as per dimensions. My fact table can be seen as below :

5
Below is the star schema generated with the help of Microsoft SQL Server Management
Studio. I have taken two facts in my fact table, total sales and sales type(which is a factless
fact itself)-

All supporting documents/project files are submitted in the attachment with this PDF
file

6
QUESTION 3
Using Microsoft SQL Server, implement your tables and extract, transform and load data
from the operational source(s) into the data warehouse. This can be done using any available
tool such as SSIS or by writing SQL statements

Solution : Below are all screen-shots for ETL done by Microsoft SQL server and SSIS, out of
which dimension sales representative is loaded in SSIS and dimensions for date territory
and Fact table is loaded in SQL server.

i. Loading dimension date by SQL

7
ii. Loading dimension territory by SQL

iii. Loading dimension sales representative by SSRS

Here I have sorted two OLTP tables ”SalesRepresentative” and ”Person” from Adventure-
WorksCycles to create a merge join, so that the result can be mapped as columns to the data
warehouse dimension table(dim-sales-rep). Afterwards, result can be extracted, transformed
and loaded into the dimension table by executing the data flow.

8
9
iv. Loading fact sales by SQL

All supporting documents/project files are submitted in the attachment with this PDF.

10
QUESTION 4
Produce four reports in support of the requirements outlined in section 1 using suitable
tool(s) (SSRS and / or Tableau). Also produce four visualizations using suitable tool(s) (R
and / or Python and / or Tableau / PowerBI) and discuss.

Solution : Below are all screens for reports and visualizations done using R, Tableau and SSRS.

i. Visualization in R : I have done below two visualizations in R for ”Sales per sales represen-
tatives” and ”Top five sales representatives”.

a. Visualization on sales by sales representatives : Total sales for all sales representatives
is calculated by a sql query in R and plotted in descending order in a bar plot. The y-axis
show sales representatives names and x-axis is demonstrating the sales in millions.

11
b. Visualization on Top five sales representatives : A pie-donut chart is used to display
the facts and description about top five performing sales representatives. Additionally, pies
are showing total sales percentage acquired by each sales representative and donut circle
represents the amount of sales done which is taken as 100% data from data warehouse.

ii. Visualization and Reports in Tableau : I have done two reports for yearly internet re-
seller sales and quarterly sales for adventure works in tableau as well as visualizations for
Internet versus Re-seller sales and Sales per Territory. Screen-shots can be found below and
.twb file can be found in attachment.

a. Report on yearly online sales versus re-seller sales

12
b. Visualization on yearly sales by territory : We have a grouped bar plot demonstrating
the yearly sale for each territory. On lower x-axis we can see the particular year for which
the sale is made and on the top of x-axis we can see the groups dividing the data as territory.
While, y-axis comprises of sales done in millions.

13
c. Visualization on internet sales against re-seller sales : A simple pie chart displays the
amount of sale done by domain wise(online/offline).

14
d. Report on quarterly sales by territory : A report as a matrix table showing rows
grouped as territory and columns by year and quarter of sale.

15
iii. Reports in SSRS : Reports in SSRS are as below for Sales per Representatives, Top 5
Sales Representatives and Territory wise sales by sales representatives.

a. Reports on sales representatives : The SSRS report is drawn on the basis of visualization
done in R for top sales representatives and total sales per sales representatives.

16
b. Report on Territory wise sales by sales representatives : Report is generated for
sales done by sales representatives across territories. A table below represents the column
wise matching and row wise comparison of the sales, representatives and territory.

All supporting documents/project files are submitted in the attachment with this PDF file.

17
QUESTION 5
A good data warehousing professional is always scouting for new sources of data that can be
applied to business intelligence Plus, analysis can be richer and more open ended if you work
with original XML documents, the same way youd work with any detailed source data. In
this spirit and to enhance an understanding of XML, develop an XML Schema based on the
data warehouse data that a cube may capture and apply it to develop an XML document.

Solution : Below are the screens for XML validation and respective schemas.

i. XML validation :

18
ii. XML code :

19
iii. XSD schema :

20
iv. DTD schema :

All supporting documents/project files are submitted in the attachment with this PDF file

21
QUESTION 6
Implement a part of your source database or data warehouse as graph database using Neo4j
technologies. Discuss the use of graph databases in comparison to relational databases.

Solution : The screen-shots of neo4j graph and results are as below :

i. Schema - Below is the schema generated in neo4j graph database. It consist of product,
customer, employee and product category as nodes and their respective relationships:

22
ii. Creating indexes for faster lookups and constraint for unique key :

23
iii. Creating nodes and importing CSV data into them :

iv. Creating relationships between nodes :

24
v. Result Drawn:

a. Total product sold by Linda to Catherine

25
b. Who sold order number 43659 to which customer

26
c. List of top 10 customers as per total number of products purchased

27
d. Show Employee and to whom they reports

28
e. What are products supplied by International Trek Center

References
XML validation is done online at - Utilities-online <http://www.utilities-online.info/>.
Significant help to gain insights of problems has been taken from following websites - stack
overflow<http://www.stackoverflow.com/>

29
Appendix A

1. How to locate the supporting documents?


a. All the supporting documents for questions above can be found in the zip file attached with
this PDF.
b. Path to navigate to resources - we can open any folder named as per questions solved, and
paths can be navigated as below -

• folder-name-as-per-question/screens - for viewing screen shots only.

• folder-name-as-per-question/ - for accessing project files.

example- ETL/screen - for ETL screenshots


and ETL/ - for SQL and SSIS files

Appendix B

1. Data warehouse development scripts - screenshots

30
31
32
2. Visualization in R code - Visualization screens are shown with question 4 and code to do
so is given below. Code can also be found in attachment with this PDF

3. ETL Screens - All ETL screens are pasted below respective questions

33

You might also like