You are on page 1of 37

Business Intelligence I

Δαμιανός Χατζηαντωνίου (damianos@aueb.gr)


Τμήμα Διοικητικής Επιστήμης και Τεχνολογίας
Οικονομικό Πανεπιστήμιο Αθηνών
Topics
 Basic Concepts
 Extract-Transform-Loading (ETL)
 Data Warehouse Modeling
 Implementation
 Queries over Data Warehouses
 Best Practices

Data Management, Business Intelligence and Visualization 2


Basic Concepts
Business Intelligence, Definition (1)

“Business intelligence (BI ) software is a


collection of decision support technologies for
the enterprise, aimed at enabling knowledge
workers such as executives, managers, and
analysts to make better and faster decisions.”

Data Management, Business Intelligence and Visualization 4


Business Intelligence, Definition (2)
 [Barry Devlin] A single, complete and consistent store
of data obtained from a variety of different sources
made available to end users in a what they can
understand and use in a business context.
 [Bill Inmon] A data warehouse is a subject-oriented,
integrated, time-varying, non-volatile collection of
data that is used primarily in organizational decision
making.

Data Management, Business Intelligence and Visualization 5


Business Intelligence – Aggregation
Which are our
lowest/highest margin
customers ?
Who are my customers
and what products
What is the most are they buying?
effective distribution
channel?

What product prom- Which customers


-otions have the biggest are most likely to go
impact on revenue? to the competition ?

What impact will


new products/services
have on revenue
and margins?

Data Management, Business Intelligence and Visualization 6


Data Analysis in SQL Terms
 What is Data Analysis?
 The difference between:
SELECT DeptCode, Name, Salary
retrieve
FROM Employees information
ORDER BY DeptCode

SELECT DeptCode, avg(Salary)


FROM Employees extract
features
GROUP BY DeptCode

Data Management, Business Intelligence and Visualization 7


Importance of Grouping and Aggregation

Forming one or more data sets

Extracting features from these

Assessment

Data Management, Business Intelligence and Visualization 8


“Traditional” DBMS (OLTP)
 Used to run the business in real time
 Based on up-to-the-second data
 Optimized to handle large numbers of simple
read/write transactions (normalization)
 Optimized for fast response to transactions
 Used by people who deal with customers,
products, accounts (clerks, salespeople, etc.)
 They are increasingly used by customers (web)

Data Management, Business Intelligence and Visualization 9


Traditional DBMS “Problems”
 I can’t find the data I need
 data is scattered over the network
 many versions, subtle differences – in multiple systems
 I can’t get the data I need
 need an expert to get the data – in legacy systems
 I can’t understand the data I found
 available data poorly documented – gender has three values
 I can’t use the data I found
 results are unexpected – null values where they shouldn’t be
 data needs to be transformed from one form to other – $ vs. €

Data Management, Business Intelligence and Visualization 10


Challenges

 Data Quality – Data must be cleaned


 Integration – Data must be integrated
 Functionality – Ability to express complex
analytical queries easily
 Performance – results must come back fast
“A new generation of data management
systems, aimed toward data analysis”

Data Management, Business Intelligence and Visualization 11


Analysis Queries and Production Systems
 Why not use the existing (production/operational)
databases for business analysis?
 In the past five years, which product is the most profitable?
 Which public holiday we have the largest sales?
 Which week we have the largest sales?
 Does the sales of dairy products increase over time?
 Difficult to express these queries in SQL
 Will kill the performance in production systems

Data Management, Business Intelligence and Visualization 12


Why Not Operational Databases (OLTP)?
Operational DBMS Data Warehouses

users clerk, IT professional knowledge worker


function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date historical, summarized
detailed, flat relational multidimensional
isolated integrated, consolidated
usage repetitive ad-hoc
access read/write, lots of scans
index/hash on primary key
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size MB/GB GB/TB
metric transaction throughput query throughput, response

Data Management, Business Intelligence and Visualization 13


Data Warehouses – Big Picture
Monitoring & Administration
Tools
Metadata OLAP
Data sources
Repository Servers
Analysis
External
sources Data Warehouse

Extract Query/Reporting
Transform Serve
Load
Operational DBs Refresh
Data Mining

Data Marts © ACM SIGMOD Record

Data Management, Business Intelligence and Visualization 14


Building a Data Warehouse (1)
Data Marts
Source and cubes
Systems

Clients
Data
Warehouse Query Tools
Reporting
Analysis
Data Mining

1 2 3 4
Design of Extract, Transform Creation of Data
Data Warehouse Loading (ETL) OLAP Cubes Analysis
Data Management, Business Intelligence and Visualization 15
Building a Data Warehouse (2)
Data Marts
Source and cubes
Systems

Clients
Data
Warehouse Query Tools
Reporting
Analysis
Data Mining

1 2 3 4
Design of Extract, Transform Creation of Data
Data Warehouse Loading (ETL) OLAP Cubes Analysis
Data Management, Business Intelligence and Visualization 16
Building a Data Warehouse (3)
Data Marts
Source and cubes
Systems

Clients
Data
Warehouse Query Tools
Reporting
Analysis
Data Mining

1 2 3 4
Design of Extract, Transform Creation of Data
Data Warehouse Loading (ETL) OLAP Cubes Analysis
Data Management, Business Intelligence and Visualization 17
Building a Data Warehouse (4)
Data Marts
Source and cubes
Systems

Clients
Data
Warehouse
Query Tools
Reporting
Analysis
Data Mining
1 2 3 4
Design of Extract, Transform Creation of Data
Data Warehouse Loading (ETL) OLAP Cubes Analysis
Data Management, Business Intelligence and Visualization 18
Data Integration
 Data integration involves combining data
residing in different sources and providing
users with a unified view of them

(a) Data warehousing approach:


move data from data sources to
a new schema, a new database (b) Mediators: leave data to data
sources and build a mediated
schema, a virtual database with
mappings to the data sources

Data Management, Business Intelligence and Visualization 19


Extract-Transform-Loading (ETL)
ETL Process
Data Marts
Source and cubes
Systems

Clients
Data
Warehouse Query Tools
Reporting
Analysis
Data Mining

1 2 3 4
Design of Extract, Transform Creation of Data
Data Warehouse Loading (ETL) OLAP Cubes Analysis
Data Management, Business Intelligence and Visualization 21
ETL – Data Extraction
 A plethora of data sources (systems + formats)
 Running in different hardware, operating systems and networks
(communication protocols), encoding, APIs, web services, third
party, POS, ATM, Call switches (internal formats)
 Legacy systems (e.g. mainframes, file systems, COBOL, 3GL,
4GL), different relational databases (older and newer), JSON,
XML, spreadsheets, csv files, GIS data
 Unstructured data such as text, images, video, etc.

Data Management, Business Intelligence and Visualization 22


Issues in Data Extraction
 How do you extract data – or can you extract data? Do
you know what to extract? Who has data ownership?
 Where do you put the extracts, full/partial extraction,
how often, performance in production systems

Source
Systems

Staging Area
Data
Warehouse

Data Management, Business Intelligence and Visualization 23


Validations during Extraction
 Some validations are done during Extraction:
 Reconcile records with the source data
 Make sure that no spam/unwanted data loaded
 Data type check
 Remove all types of duplicate/fragmented data
 Check whether all the keys are in place or not

Data Management, Business Intelligence and Visualization 24


ETL – Data Transformation
 Creating a data warehouse is extracting operational/
external data and entering it into a data warehouse…
 Nothing could be farther from the truth
 Data comes from disparate questionable sources
 Legacy systems no longer documented
 Outside sources with questionable quality procedures
 Production systems with no built in integrity checks
 Errors, anomalies, mismatches
 In this step, you apply functions on extracted data
 Data quality is of utmost important in data warehousing

Data Management, Business Intelligence and Visualization 25


Data Transformation – Typical Problems
 Different spelling of the same person (e.g. Jon, John)
 Use of different names (e.g. Athens, Athina, Atene)
 Multiple ways to denote company name (e.g. Google,
Google Inc.) [entity resolution]
 Different account numbers generated by various
applications/departments for the same customer
 Required fields remain blank (e.g. date of birth)
[missing values]
 Invalid product codes collected [inconsistencies]
 e.g. when manual entry is permitted, it can lead to mistakes

Data Management, Business Intelligence and Visualization 26


Data Transformation – Example (1)

Data Management, Business Intelligence and Visualization 27


Data Transformation – Example (2)

Data Management, Business Intelligence and Visualization 28


Data Transformation – Typical Tasks (1)
 Filtering – Select only certain columns to load
 Using rules and lookup tables for data standardization
 Character Set Conversion and encoding handling
 Conversion of measurements units, date/time/currency
 Data values validation check, e.g. age > 18
 Required fields should not be left blank
 how to handle missing values is a whole topic by its own,
involving statistical techniques and/or business decisions

Data Management, Business Intelligence and Visualization 29


Data Transformation – Typical Tasks (2)
 Cleaning, for example:
 mapping ‘null’ to 0 (handling null values is also a whole topic)
 mapping ‘Male’ to ‘M’ and Female to ‘F’
 Split a column into multiples and merging multiple
columns into a single column
 Transposing rows and columns,
 Use lookups in other tables/data to merge data
 Using any complex data validation
 e.g. if the first two columns in a row are empty then it
automatically reject the row from processing

Data Management, Business Intelligence and Visualization 30


Data Transformation – Advanced Topics
 Close relationship to data engineering tasks in data
science projects
 Web scraping
 e.g. Python + Beautiful Soup library + Selenium
 Entity resolution
 e.g. Python + dedupe library, string distance funcs, LiveRamp
 Missing values
 choose one value randomly, choose from a distribution, use a
model to predict what is missing, multiple imputation, R/Python
 Information extraction

Data Management, Business Intelligence and Visualization 31


ETL – Data Loading
 Data load and refresh utilities are responsible for
moving transformed data from operational databases
and external sources into the data warehouse quickly
and with as little performance impact as possible at both
ends.
 Types of Loading
 Initial Load: populating all the data warehouse tables
 Incremental Load: applying ongoing changes in data sources
periodically (during refreshing)
 Full Refresh: erase the contents of one or more tables and
reload with fresh data

Data Management, Business Intelligence and Visualization 32


Issues in Data Loading
 Huge volumes of data to be loaded
 Small time window available when warehouse can be
taken off line (usually nights)
 When to build cubes, index and summary tables
 Allow system administrators to monitor, cancel,
resume, change load rates
 Recover gracefully -- restart after failure from where
you were and without loss of data integrity

Data Management, Business Intelligence and Visualization 33


Data Warehouse Refreshing
 How do you capture changes at the source?
 triggers – expensive, overhead

 create incremental extracts (if data allows that, e.g. sales)


 compare source and DW data
 do a full refresh if more efficient
 Refresh requires caution
 e.g. slowly-changing dimensions problem
 Moving captured data to the warehouse
 specialized, performance-optimized APIs for bulk-loading data
 partitioning the data at the warehouse helps

Data Management, Business Intelligence and Visualization 34


ETL – Tools and Platforms
 SQL Server Integration Services (SSIS)
 Oracle Data Integrator
 Informatica
 Talend
 Pentaho
 Python/R packages for ETL
 (Airflow) - a platform to author, schedule and monitor
workflows using Python

Data Management, Business Intelligence and Visualization 35


SSIS – Example

Data Management, Business Intelligence and Visualization 36


Conclusions: Extract–Transform–Loading
 Extract-Transform-Load (ETL) refers to a collection of
tools that play a crucial role in helping discover and
correct data quality issues and efficiently load large
volumes of data into the warehouse.
 ETL is a highly practical topic, involving several ad hoc
data manipulation tasks, case-specific. You do ETL,
you don’t learn ETL.
 Very expensive process in building a DW (80%)

Data Management, Business Intelligence and Visualization 37

You might also like