You are on page 1of 12

Urban Analytics

Lab 3 – Data Engineering


Miguel de Castro Neto
mneto@novaims.unl.pt

Spring Semester 2018/2019

Instituto Superior de Estatística e Gestão de Informação


Universidade Nova de Lisboa
Objectives for this Lab
1. Understand the importance of Data Integration and Data Engineering in the
Business Intelligence overall process

2. Understand the importance of the Data Quality Process in the Business


Intelligence overall process.

3. Understand main Power BI capabilities for data loading, data transformation and
data formatting, from a Business Analyst perspective.

4. Learn by testing with several practice examples the main data transformations
capabilities in Power BI.

Instituto Superior de Estatística e Gestão de Informação


Universidade Nova de Lisboa
Extraction, Transformation, and Load (ETL) Overview
Extraction, transformation, and load (ETL), in the context of data warehousing:
• Extraction: selecting data from one or more sources and reading the selected data
• Transformation: converting data from their original form to whatever form the DW needs. This step often
also includes cleansing of the data to remove as many errors as possible.
• Loading: putting the converted (transformed) data into the DW

Staging
Database
Packaged Data
application Warehouse

Legacy
system Extract Transform Cleanse Load

Internal Data Mart


applications

Instituto Superior de Estatística e Gestão de Informação


Universidade Nova de Lisboa
Common Activities in the Data Integration process
Arbitrary pieces of code to take data from a source, convert it into data for the data
warehouse for analysis:

• Data Loading – read and convert from data sources

• Data Transformations – join, aggregate, filter, convert data

• Data de-duplication – finds multiple records referring to the same entity, merges
them

• Data Profiling – builds tables, histograms, etc. to summarize data

• Data Quality – test against master values, known business rules, constraints, etc.

Instituto Superior de Estatística e Gestão de Informação


Universidade Nova de Lisboa
Power BI – Basic Transformations
Data Loading Rows and Columns Management Basic Transformations Combining Data

Query Properties

Queries

Transformations

Data Preview

Instituto Superior de Estatística e Gestão de Informação


Universidade Nova de Lisboa
Power BI - Advanced Transformations
Grouping & Transpose Columns & Unpivot Text & Formating Math & Statistics Date & Time Date & Time

Instituto Superior de Estatística e Gestão de Informação


Universidade Nova de Lisboa
Power BI – Extending with M Language

M is informal name of Power Query


Formula Language, used in Power BI and
Excel. M stands for Data Mashup.

Syntax of this language is simple. It always


has two blocks of programming: LET
expression block, and IN expression block:

Lines of codes in M continues if you don’t


put the end of the line character.

Name of variables can be all one word, like


Source. or it can has spaces in it. In case
that you have some characters such as
space, then you need to put the name
inside double quote (“) and put a hashtag at
the beginning of it(#).

Invoking a function can be easily called with


name of the function and specifying
parameters for it.

Power Query (“M”) Formulas Language

Instituto Superior de Estatística e Gestão de Informação


Universidade Nova de Lisboa
A real-world exemple of M Transformation

Instituto Superior de Estatística e Gestão de Informação


Universidade Nova de Lisboa
Exercise
Exploring Data
Transformations

Instituto Superior de Estatística e Gestão de Informação


Universidade Nova de Lisboa
Exercise – Exploring Data Transformations
1. [New Project] Load CD information from the CD Catalog XML file*. Please make sure that all data types and
data structure are correct. Then count rows but remove the transformation at the end. Next, group the
albums by country in order to get the number of albums per country and also the average price per country.

2. [New Project] Now, let’s import all csv files from the Stores folder* by selecting the Combine & Edit button.
Start by removing the first column with the source filename and use the first row as the columns headers.
Then, to remove each first row from each file, on the StoreType column remove rows with StoreType value. To
prepare sales analysis, select the 6 columns with the yearly sales (2008 to 20013) and unpivot those 6
columns into a Value and Date columns. On the Date Columns change its type to Date and on the Value
columns change the value to thousands.

3. Now load data from DIMStores (Contoso Access Database)*. Then select the stores table previously imported
from the Folder and merge both queries using the GeographyKey as the relationship key.
Once merged select all locations attributes, except the GeographyKey column.

4. [New Project] Finally lets try the new “Add a column from an example” Power BI feature using the list of states
and territories of the United States
(https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States) to create a new column
of states and a another with the month of establishment (3 letters only).
* Lab 2 Files
Instituto Superior de Estatística e Gestão de Informação
Universidade Nova de Lisboa
Exercise – Exploring Data Transformations
1. [New Project] Load all data from the Financial Sample Excel File.
//Create Date Dimension
Check everything is ok. Then create a new column to store the (StartDate as date, EndDate as date)=>
let
percentage of Discount (Discounts/Sales). Add also a conditional //Capture the date range from the parameters

column to segment sales above 100.000 as High, above 50.000 StartDate = #date(Date.Year(StartDate),
Date.Month(StartDate), Date.Day(StartDate)),

as Medium and the remaining ones as Low. Finally, remove the EndDate = #date(Date.Year(EndDate),
Date.Month(EndDate), Date.Day(EndDate)),
Discounts column. //Get the number of dates that will be required for the table
GetDateCount = Duration.Days(EndDate - StartDate),
//Take the count of dates and turn it into a list of dates
GetDateList = List.Dates(StartDate,
2. Now let’s work the Date column. Duplicate this column twice GetDateCount, #duration(1,0,0,0)),
//Convert the list into a table
and calculate the quarter number and the name of the day. On DateListToTable =
Table.FromList(GetDateList, Splitter.SplitByNothing(), {"Date"}, null,
the Date column check the earliest date also. ExtraValues.Error),
//Create various date attributes from the date column:
//Add Year Column
3. Let’s now see in more detail the M Language and use it to YearNumber = Table.AddColumn(DateListToTable, "Year", each
Date.Year([Date])),
create a function that creates a new date table, and receive the //Add Quarter Column
QuarterNumber = Table.AddColumn(YearNumber , "Quarter", each "Q" &
start date and the end data as parameters. Number.ToText(Date.QuarterOfYear([Date]))),
//Add Week Number Column
WeekNumber= Table.AddColumn(QuarterNumber , "Week Number", each
Date.WeekOfYear([Date])),
4. For that copy and past the code on the right into the advanced //Add Month Number Column
MonthNumber = Table.AddColumn(WeekNumber, "Month Number", each
editor in power BI [New Source > Blank Query > Advanced Date.Month([Date])),

Editor]. Then define the start and end dates and invoke the //Add Month Name Column
MonthName = Table.AddColumn(MonthNumber , "Month", each
function. Date.ToText([Date],"MMMM")),
//Add Day of Week Column
DayOfWeek = Table.AddColumn(MonthName , "Day of Week", each
Date.ToText([Date],"dddd"))
5. To visualize all data sources and how they are transformed, in
DayOfWeek
select the Queries Dependencies button and check all relations.
Instituto Superior de Estatística e Gestão de Informação
Universidade Nova de Lisboa
Thank You!

Instituto Superior de Estatística e Gestão de Informação


Universidade Nova de Lisboa

You might also like