You are on page 1of 17

Shubham verma 50

Practical 1: Import the legacy data from different


sources such as (Excel, SqlServer, Oracle etc.) and
load in the target system
1. Create an excel sheet with data.
2. Open Power BI desktop.
3. Go to Home Ribbon -> Get Data -> Excel & browse excel file.
4. In the Navigator tab, select table (sheet 1) from dataset (Legacy dataset.xlsx) URL:
https://docs.microsoft.com/en-us/power-bi/create-reports/sample-financial-download.
5. 5. Click Edit button.
6. We will obtain the screen for queries.

1
Shubham verma 50

Practical 2: Perform the Extraction


Transformation and Loading (ETL) process to
construct the database in the Sqlserver / Power
BI.
An ETL tool extracts the data from different RDBMS source systems,
transforms the data like applying calculations, concatenate, etc. and then load
the data to Data Warehouse system.

It's tempting to think a creating a Data warehouse is simply extracting data from
multiple sources and loading into database of a Data warehouse. This is far from the
truth and requires a complex ETL process. The ETL process requires active inputs
from various stakeholders including developers, analysts, testers, top executives and
is technically challenging.
In order to maintain its value as a tool for decision-makers, Data warehouse technique
needs to change with business changes. ETL is a recurring method (daily, weekly,
monthly) of a Data warehouse system and needs to be agile, automated, and well
documented.
Why do you need ETL?
There are many reasons for adopting ETL in the organization:
1. It helps companies to analyze their business data for taking critical business
decisions.
2. Transactional databases cannot answer complex business questions that can be
answered by ETL.
3. A Data W
arehouse provides a common data repository
4. ETL provides a method of moving the data from various sources into a data
warehouse.
5. As data sources change, the Data Warehouse will automatically update.
6. Well-designed and documented ETL system is almost essential to the success of a
Data Warehouse project.
7. Allow verification of data transformation, aggregation and calculations rules.
8. ETL process allows sample data comparison between the source and the target
system.
9. ETL process can perform complex transformations and requires the extra area to
store the data.
10. ETL helps to Migrate data into a Data Warehouse. Convert to the various formats
and types to adhere to one consistent system.

2
Shubham verma 50

11. ETL is a predefined process for accessing and manipulating source data into the
target database.
12. ETL in data warehouse offers deep historical context for the business.
13. It helps to improve productivity because it codifies and reuses without a need for
technical skills.

ETL Process in Data Warehouses


ETL is a 3-step process
Step 1) Extraction In this step of ETL architecture, data is extracted from the source
system into the staging area. Transformations if any are done in staging area so that
performance of source system in not degraded. Also, if corrupted data is copied
directly from the source into Data warehouse database, rollback will be a challenge.
Staging area gives an opportunity to validate extracted data before it moves into the
Data warehouse. Data warehouse needs to integrate systems that have different
DBMS, Hardware, Operating Systems and Communication Protocols.

Sources could include legacy applications like Mainframes, customized applications,


Point of contact devices like ATM, Call switches, text files, spreadsheets, ERP, data from
vendors, partners amongst others.
Three Data Extraction methods:
1. Full Extraction
2. Partial Extraction- without update notification.
3. Partial Extraction- with update notification
4. Irrespective of the method used, extraction should not affect performance and response
time of the source systems. These source systems are live production databases. Any slow
down or locking could effect company's bottom line.

Some validations are done during Extraction:


1. Reconcile records with the source data

3
Shubham verma 50

2. Make sure that no spam/unwanted data loaded


3. Data type check
4. Remove all types of duplicate/fragmented data
5. Check whether all the keys are in place or not.
Step 2) Transformation

Data extracted from source server is raw and not usable in its original form. Therefore it
needs to be cleansed, mapped and transformed. In fact, this is the key step where ETL
process adds value and changes data such that insightful BI reports can be generated.
It is one of the important ETL concepts where you apply a set of functions on extracted
data. Data that does not require any transformation is called as direct move or pass
through data. In transformation step, you can perform customized operations on data. For
instance, if the user wants sum-of-sales revenue which is not in the database. Or if the
first name and the last name in a table is in different columns. It is possible to concatenate
them before loading.
Following are Data Integrity Problems:
1. Different spelling of the same person like Jon, John, etc.
2. There are multiple ways to denote company name like Google, Google Inc.
3. Use of different names like Cleaveland, Cleveland.
4. There may be a case that different account numbers are generated by various
applications for the same customer.
5. In some data required files remains blank
6. Invalid product collected at POS as manual entry can lead to mistakes.
Validations are done during this stage
1. Filtering – Select only certain columns to load
2. Using rules and lookup tables for Data standardization
3. Character Set Conversion and encoding handling
4. Conversion of Units of Measurements like Date Time Conversion, currency
conversions, numerical conversions, etc.
5. Data threshold validation check. For example, age cannot be more than two digits.

4
Shubham verma 50

6. Data flow validation from the staging area to the intermediate tables.
7. Required fields should not be left blank.

8. Cleaning ( for example, mapping NULL to 0 or Gender Male to "M" and Female to "F"
etc.)
9. Split a column into multiples and merging multiple columns into a single column.
10. Transposing rows and columns,
11. Use lookups to merge data
12. Using any complex data validation (e.g., if the first two columns in a row are empty
then it automatically reject the row from processing)

STEP 3) LOADING
Loading data into the target datawarehouse database is the last step of the ETL process. In
a typical Data warehouse, huge volume of data needs to be loaded in a relatively short
period (nights). Hence, load process should be optimized for performance.
In case of load failure, recover mechanisms should be configured to restart from the point
of failure without data integrity loss. Data Warehouse admins need to monitor, resume,
cancel loads as per prevailing server performance.
Types of Loading:
1. Initial Load — populating all the Data Warehouse tables
2. Incremental Load — applying ongoing changes as when needed periodically.
3. Full Refresh —erasing the contents of one or more tables and reloading with fresh
data.

5
Shubham verma 50

In most organizations, data goes through an ETL (extract, transform and load) process
before it is available for reporting. During the ETL process, data is extracted from a data
source, then transformed, validated, standardized, corrected, quality checked and
ultimately loaded into a data repository—such as a data mart or data warehouse— where
it is streamlined for analysis and reporting. (https://agilethought.com/blogs/extract-
transform-load-data-power-bi/)
1. Open Power BI desktop. (https://agilethought.com/blogs/extract-transform-load-
data-power-bi/)
2. Go to Home Ribbon -> Get Data -> Excel & browse excel file (URL:
https://catalog.data.gov/dataset/general-aviation-2013).
3. Navigator tab, select table (Data_GA.xls) from dataset (Annual Review.xlsx).
4. Click on Edit button.
5. Power Query Editor window will open.
6. Reduce -> Remove rows ->Remove top rows.
7. Enter the number of rows you want to reduce (here it is just one row from top).
8. Transform -> Use first row as headers.
9. Your table has appropriate headers now.
10. Right click on the columns which you don’t want to use -> Remove Columns
longitude.
11. Replace values by selecting that cell -> Right click (null)-> Replace Values….
12. Provide the new value (0) and click on OK.
13. All values having the values you wished to replace earlier will be replaced with
new values.

6
Shubham verma 50

Practical 3: Create the Data staging area for the


selected database. (Cube:)

1. A staging area, or landing zone, is an intermediate storage area used for data processing
during the extract, transform and load (ETL) process. The data staging area sits between
the data source(s) and the data target(s), which are often data warehouses, data marts, or
other data repositories.
2. Data staging areas are often transient in nature, with their contents being erased prior to
running an ETL process or immediately following successful completion of an ETL
process. There are staging area architectures, however, which are designed to hold data
for extended periods of time for archival or troubleshooting purposes.
3. Imagine you have collected data from multiple sources. You now need to do some
processing on data like extract, transform, validate, clean, etc.
4. This intermediate process is called data staging. Once all this is completed, you can
load the data to a data warehouse

7
Shubham verma 50

1. Click File -> New -> Project -> Business Intelligence Projects -> select
Analysis Services Project -> Assign Project Name -> Click OK.
2. In Solution Explorer, Right click on Data Source -> Click New Data Source.
3. Click on Next.
4. Creating New Connection
5. Click on Test Connection and verify for its success.
6. Select Connection created in Data Connection -> Click Next.
7. Select Option Inherit.
8. Assign Data Source Name (Sales DW DS) -> Click Finish.
9. In the Solution Explorer, Right Click on Data Source View -> Click on New Data
Source View.
10. Click Next.
11. Select Relational Data Source we have created previously (Sales DW DS) -> Click
Next.
12. First move your Fact Product Table to the right side to include in object list.
13. Select Fact Table in Right Pane (Fact product Sales) -> Click on Add Related Tables
-> Click Next.
14. Assign Name (Sales DW DSV) -> Click Finish.
15. Now Data Source View is ready to use.
16. In Solution Explorer -> Right Click on Cube -> Click New Cube.
17. Click Next.
18. Select Option Use existing Tables -> Click Next.
19. Select Fact Table Name from Measure Group (FactProductSales) -> Click Next.
20. Choose Measures from the List which you want to place in your Cube -> Click Next.
21. Select All Dimensions here which are associated with your Fact Table -> Click Next.
22. Assign Cube Name (Sales DW CUBE) -> Click Finish.
23. Now your Cube is ready, you can see the newly created cube and dimensions added
in your solution explorer.
24. In Solution Explorer, double click on dimension Dim Product -> Drag and Drop
Product Name from Table in Data Source View & Add in Attribute Pane at left side.
25. Double click on Dim Date dimension -> Drag and Drop Fields from Table shown in
Data Source View to Attributes -> Drag & Drop attributes from left most pane of
attributes to middle pane of Hierarchy.

8
Shubham verma 50

26. In Solution Explorer, right click on Project Name (Analysis Services Project1) ->
Click Properties.
27. In Configuration Properties, Select Deployment -> Assign Your SQL Server Instance
Name where Analysis Service is Installed (GUFRAN\PCGUFRAN) (Machine Name\
Instance Name) Choose Deployment Mode Deploy all as of now -> Select Processing
Option Do not Process -> Click Ok.
28. In Solution Explorer, right click on Project Name (AnalysisServicesProject) -> Click
Deploy.
29. Once Deployment will finish, you can see the message Deployment Completed in
deployment Properties.
30. In Solution Explorer, right click on Project Name (AnalysisServicesProject) -> Click
Process.
31. Click on Run Button to process the Cube.
32. Once procession is complete, you can see Status as Process Succeeded -> Click Close
to close both the open windows for processing one after the other.
33. In Solution Explorer, right click on Cube Name (Sales DW CUBE) -> Click Browse.
34. Drag and drop measures in to Detail fields & Drag and Drop Dimension Attributes in
Row Fields or Column Fields.
SQL SERVER 2008 Advance (SQLEXPRADV_x64_ENU)
https://www.microsoft.com/en-us/download/details.aspx?id=30438
SQL SERVER 2012 Advance (SQLEXPRADV_x64_ENU)
https://www.microsoft.com/en-us/download/details.aspx?id=43351
Database File: https://www.codeproject.com/Articles/652108/Create-First-Data-
WareHouse

Practical 4(a): Create ETL Map & setup the


schedule for execution
1. What is OLAP?

9
Shubham verma 50

Online Analytical Processing, a category of software tools which provide analysis of data
for business decisions. OLAP systems allow users to analyze database information from
multiple database systems at one time. The primary objective is data analysis and not data
processing.
2. What is OLTP?
Online Transaction Processing, shortly known as OLTP supports transaction-oriented
applications in a 3-tier architecture. OLTP administers day to day transaction of an
organization. The primary objective is data processing and not data analysis.

Example of OLAP:
Any datawarehouse system is an OLAP system. Users of OLAP are as follows: A
company might compare their mobile phone sales in September with sales in October,
then compare those results with another location which may be stored in separate
database

Practical 4(a): Create ETL Map & setup the


schedule for execution
1. Open Power BI desktop.
2. Go to Home Ribbon -> Get Data -> Web.
3. Paste the link where your data resides (https://www.bankrate.com/retirement/best-and-
worst-statesfor-retirement/)
4. Click OK 5. Connecting might take a while.
6. Select the table having your required data & click EDIT button.
7. Right click on any column to change its type & select that type (Whole Number).
8. Drag & drop the columns on workspace to see visual representation of data.
9. Add Column -> Custom Column. 10. The Custom Column window will appear.
11. Rename the column & add the formula to calculate your new column data.
12. Change the type of the newly calculate column data to whole number for regularity in
data. 13. Remove irrelevant columns, you can select multiple columns by pressing CTRL
key.
14. Move the remove columns action upwards to keep data consistency.
15. Due to the removal of columns our newly calculated columns has values, modify the
formula in the query bar to remove the error.
16. Drop down click on column -> Sort Ascending (to sort data in ascending order)
17. Replace the values if you want. Always agree to insert steps in between.

10
Shubham verma 50

18. Open one more data file to merge it with our current data. Import from web again.
(https://en.wikipedia.org/wiki/List_of_U.S._state_abbreviations).
19. Select the appropriate table.
20. Make data relevant by adjusting its headers properly (Remove top 2 rows).
21. Reduce the data to avoid redundancy (Remove alternate rows).
22. Our data doesn’t have federal district, remove it by right clicking on the columns of
states & uncheck federal district.
23. Select & remove unnecessary columns.
24. Rename Columns.
25. Combine -> Merge Queries -> Merge Queries as New.
26. Choose your tables to merge & press OK.
27. Here we just want to display the State Codes data from the entire table so we drop
down click & select only that column.

Practical 4(b): Execute the MDX queries to


extract the data from the datawarehouse
1. Open SQL Server and connect to the server.
2. Click on New Query & type following query based on Sales_DW
3. select [Measures].[Sales Time Alt Key] on columns from [Sales DW]. Click on
Execute.
4. select [Measures].[Quantity] on columns from [Sales DW].
5. select [Measures].[Sales Invoice Number] on columns from [Sales DW].
6. select [Measures].[Sales Total Cost] on columns from [Sales DW].
7. select [Measures].[Sales Total Cost] on columns, [Dim Date].[Year].[Year] on rows
from [Sales DW].
8. select [Measures].[Sales Total Cost] on columns, NONEMPTY ( { [ Dim Date].[Year].
[Year] } ) on rows from [Sales DW].

11
Shubham verma 50

9. select [Measures].[Sales Total Cost] on columns from [Sales DW] where [Dim Date].
[Year].[Year].&[2013]

Practical 5(a): Import the datawarehouse data in


Microsoft Excel and create the Pivot table and
Pivot Chart
1. Open a blank workbook. Click Data -> Get External Data -> From Access.
2. Select the OlympicMedals.accdb file and click Open
(http://howtomicrosoftofficetutorials.blogspot.com/2017/08/tutorial-import-data-into-
excel-and.html).
3. Check the Enable Selection of Multiple Tables box and select all the table. Click OK.
4. The Import Data window appears. Select the Pivot Table Report option and click OK.
5. A pivot table is created using the imported tables.
6. In Pivot Table Fields, expand the Medals table. Find the NOC_CountryRegions and
drag it to the Columns area.
7. Expand the Disciplines table. Find the Disciplines table and drag it to the Rows area.
8. Filter disciplines to display only five sports: archery, diving, fencing, figure skating
and speed skating. Click Ok. Click anywhere in the PivotTable to ensure the excel
PivotTable is selected. In the PivotTable Fields list, where the Disciplines table is
expanded, hover over its Discipline field and a dropdown arrow appears to the right of the
field. Click the dropdown, click (Select All)to remove all selections, then scroll down and
select Archery, Diving, Fencing, Figure Skating, and Speed Skating. Click OK.
9. In PivotTable Fields, from the Medals table, drag Medal to the VALUES area. Since
Values must be numeric, Excel automatically changes Medal to Count of Medal.
10. From the Medals table, select Medal again and drag it into the FILTERS area
11. Let’s filter the PivotTable to display only those countries or regions with more than
90 total medals. In the PivotTable, click the dropdown to the right of Column Labels.
12. Select Value Filters and select Greater Than….
13. Type 90 in the last field (on the right). Click OK.
14. Your PivotTable looks like the following screen.
15. Let’s start by creating a blank worksheet, then import data from an Excel workbook.
Insert a new Excel worksheet, and name it Sports.
16. Browse to the folder that contains the downloaded sample data files, and open
OlympicSports.xlsx.

12
Shubham verma 50

17. Select and copy the data in Sheet1. If you select a cell with data, such as cell A1, you
can press Ctrl + A to select all adjacent data. Close the OlympicSports.xlsx workbook. On
the Sports worksheet, place your cursor in cell A1 and paste the data.
18. With the data still highlighted, press Ctrl + T to format the data as a table. You can
also format the data as a table from the ribbon by selecting HOME -> Format as Table.
Since the data has headers, select My table has headers in the Create Table window that
appears.
19. Name the table. In TABLE TOOLS -> DESIGN -> Properties, locate the Table Name
field and type Sports. The workbook looks like the following screen. Save the workbook.
20. Insert a new Excel worksheet, and name it Hosts.
21. Select and copy the following table, including the table headers.
22. In Excel, place your cursor in cell A1 of the Hosts worksheet and paste the data.
23. Format the data as a table. As described earlier in this tutorial, you press Ctrl + T to
format the data as a table, or from HOME -> Format as Table. Since the data has headers,
select My table has headers in the Create Table window that appears.
24. Name the table. In TABLE TOOLS -> DESIGN -> Properties locate the Table Name
field, and type Hosts.
25. Select the Edition column, and from the HOME tab, format it as Number with 0
decimal places.
26. Save the workbook. Your workbook looks like the following screen.
27. On Sheet1, at the top of PivotTable Fields, click All to view the complete list of
available tables, as shown in the following screen.
28. Expand Sports and select Sport to add it to the PivotTable. Notice that Excel prompts
you to create a relationship, as seen in the following screen.
29. Click CREATE, in the highlighted PivotTable Fields area to open the Create
Relationship dialog, as shown in the following screen.
30. In Table, choose Disciplines from the drop down list. In Column (Foreign), choose
SportID. In Related Table, choose Sports. In Related Column (Primary), choose SportID.
Click OK.
31. In the ROWS area, move Sport above Discipline. That’s much better, and the
PivotTable displays the data how you want to see it, as shown in the following screen.

A pivot table is a statistics tool that summarizes and reorganizes selected columns and
rows of data in a spreadsheet or database table to obtain a desired report. The tool does
not actually change the spreadsheet or database itself, it simply “pivots” or turns the data
to view it from different perspectives.

13
Shubham verma 50

Practical 5(b): Import the cube in Microsoft Excel


and create the Pivot table and Pivot Chart to
perform data analysis

1. Go to File-> Options -> Add-Ins.


2. In the manage box near the bottom, click COM Add-Ins -> Go
3. Check the Microsoft Office Power Pivot in Microsoft Excel 2013 box, and then click
OK.
4. The following data is displayed.
5. The Excel workbook includes a table called Hosts. We imported Hosts by copying it
and pasting it into Excel, then formatted the data as a table.
6. In Excel, click the Hosts tab to make it the active sheet. On the ribbon, select POWER
PIVOT -> Tables -> Add to Data Model. This step adds the Hosts table to the Data
Model.
7. The Power Pivot window shows all the tables in the model, including Hosts. Click
through a couple of tables. In Power Pivot you can view all of the data that your model
contains, even if they aren’t displayed in any worksheets in Excel, such as the
Disciplines, Events, and Medals data below, as well as S_Teams, W_Teams, and Sports.
8. In the Power Pivot window, in the View section, Click Diagram View.
9. Use the slide bar to resize the diagram so that you can see all objects in the diagram.
Rearrange the tables by dragging their title bar, so they’re visible and positioned next to
one another. Notice that four tables are unrelated to the rest of the tables: Hosts, Events,
W_Teams, and S_Teams.
10. Notice that both the Medals table and the Events table have a field called
DisciplineEvent. Upon further inspection, you determine that the DisciplineEvent field in
the Events table consists of unique, non-repeated values.
11. Create a relationship between the Medals table and the Event table. While in
Diagram View, drag the DisciplineEvent field from the Events table to the
DisciplineEvent field in Medals. A line appears between them, indicating a relationship
has been established. Click the line that connects Events and Medals. The highlighted
fields define the relationship, as shown in the following screen.
12. To connect Hosts to the Data Model, we need a field with values that uniquely
identify each row in the Hosts table. Then we search our Data Mode to see if that same
data exists in another table. Looking in Diagram View doesn’t allow us to do this. With
Hosts selected, switch back to Data View.
13. The following screen appears.

14
Shubham verma 50

14. Select the Hosts table in Power Pivot. Adjacent to the existing columns is an empty
column titled Add Column. Power Pivot provides that column as a placeholder. There are
many ways to add a new column to a table in Power Pivot, one of which is to simply
select the empty column that has the title Add Column.
15. In the formula bar, type the following DAX formula=CONCATENATE([Edition],
[Season]).
16. When you finish building the formula, press Enter to accept it. Values are populated
for all the rows in the calculated column.
17. Let’s rename the calculated column to EditionID. You can rename any column by
double-clicking it, or by right-clicking the column and choosing Rename Column. When
completed, the Hosts table in Power Pivot looks like the following screen.
18. Create a new column in the Medals table, like we did for Hosts. In Power Pivot select
the Medals table, and click Design -> Columns -> Add. Notice that Add Column is
selected.
19. In the formula bar above the table, type the following DAX formula:
YEAR([Edition]).
20. When you finish building the formula, press Enter. Values are populated for all the
rows in the calculated column, based on the formula you entered. Rename the column by
right-clicking. CalculatedColumn1 and selecting Rename Column. Type Year, and then
press Enter.
21. Create the EditionID calculated column, so select Add Column. In the formula bar,
type the following DAX formula and press Enter. =CONCATENATE([Year],[Season]).
22. Sort the column in ascending order.
23. The Medals table in Power Pivot now looks like the following screen.
24. In the Power Pivot window, select Home -> View -> Diagram View from the ribbon.
25. Expand Hosts so you can view all of its fields.
26. Position the Hosts table so that it is next to Medals. Drag the EditionID column in
Medals to the EditionID column in Hosts. Power Pivot creates a relationship between the
tables based on the EditionID column, and draws a line between the two columns,
indicating the relationship.
27. Expand the Event table so that you can more easily see all of its fields. Press and hold
Ctrl, and click the Sports, Discipline, and Event fields. With those three fields selected,
right-click and select Create Hierarchy.
28. A parent hierarchy node, Hierarcy1, is created at the bottom of the table, and the
selected columns are copied under the hierarchy as child nodes. Double-click the title,
Hierarchy1, and type SDE to rename your new hierarchy.
29. Still in Diagram View in Power Pivot, select the Hosts table and click the Create
Hierarchy button in the table header.

15
Shubham verma 50

30. An empty hierarchy parent node appears at the bottom of the table.
31. Type Locations as the name for your new hierarchy.
32. There are many ways to add columns to a hierarchy. Drag the Season, City and
NOC_CountryRegion fields onto the hierarchy name (in this case, Locations) until the
hierarchy name is highlighted, then release to add them.
33. Right-click EditionID and select Add to Hierarchy. Choose Locations.
34. Go back to Excel. In Sheet1, remove the fields from the ROWS area of PivotTable
Fields.
35. Remove all the fields from the COLUMNS area.
36. The only remaining fields in the PivotTable fields are Medal in the FILTERS area and
Count of Medal in the VALUES area.
37. From the PivotTable Fields area, drag SDE from the Events table to the ROWS area.
38. Then drag Locations from the Hosts table into the COLUMNS area. Just by dragging
those two hierarchies, your PivotTable is populated with a lot of data, all of which is
arranged in the hierarchy you defined in the previous steps.
39. Let’s filter that data a bit, and just see the first ten rows of events. In the PivotTable,
click the arrow in Row Labels, click (Select All) to remove all selections, then click the
boxes beside the first ten Sports.
40. Your PivotTable now looks like the following screen.
41. You can expand any of those Sports in the PivotTable, when we expand the Aquatics
sport, we see all of its child discipline elements and their data. When we expand the
Diving discipline under Aquatics, we see its child events too, as shown in the following
screen. We can do the same for Water Polo, and see that it has only one event.
42. In the PivotTable Fields area, remove Locations from the COLUMNS area.
43. Your PivotTable will have the following screen.
44. Then remove SDE from the ROWS area. You’re back to a basic PivotTable.
45. Your PivotTable will have the following screen.
46. From the Hosts table, drag Season, City, NOC_CountryRegion, and EditionID into
the COLUMNS area, and arrange them in that order, from top to bottom.
47. From the Events table, drag Sport, Discipline, and Event into the ROWS area, and
arrange them in that order, from top to bottom.
48. In the PivotTable, filter Row Labels to the top ten Sports.
49. Collapse all the rows and columns.
50. Expand Aquatics, then Diving and Water Polo. Your workbook looks like the
following screen.

16
Shubham verma 50

17

You might also like