You are on page 1of 96

Knowledge Discovery WS 14/15

Data Warehousing & OLAP 4


Prof. Dr. Rudi Studer, Dr. Achim Rettinger, Dipl.-Inform. Benedikt Kämpgen*
{rudi.studer, achim.rettinger, benedikt.kaempgen}@kit.edu

INSTITUT FÜR ANGEWANDTE INFORMATIK UND FORMALE BESCHREIBUNGSVERFAHREN (AIFB)

KIT – University of the State of Baden-Württemberg and


National Laboratory of the Helmholtz Association www.kit.edu
Knowledge Discovery Lecture WS14/15
22.10.2014 Einführung
Basics, Overview
29.10.2014 Design of KD-experiments
05.11.2014 Linear Classifiers
12.11.2014 Data Warehousing & OLAP
19.11.2014 Non-Linear Classifiers (ANNs) Supervised Techniques,
26.11.2014 Kernels, SVM Vector+Label Representation
03.12.2014 entfällt
10.12.2014 Decision Trees
17.12.2014 IBL & Clustering Unsupervised Techniques
07.01.2015 Relational Learning I
Semi-supervised Techniques,
14.01.2015 Relational Learning II
Relational Representation
21.01.2015 Relational Learning III
28.01.2015 Textmining
04.01.2015 Gastvortrag Meta-Topics
11.02.2015 Crisp, Visualisierung

2 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
Outline

Exploratory KD Techniques
Analytical Queries over Multidimensional Datasets
Data Cube and OLAP
Data structures and Operations for Analytical Queries
Data Warehouses
Implementing the Informational System

3 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
Outline

Exploratory KD Techniques
Analytical Queries over Multidimensional Datasets
Data Cube and OLAP
Data structures and Operations for Analytical Queries
Data Warehouses
Implementing the Informational System

4 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
The 3 most essential dimensions
defining a KD problem:

Data representation
Method (approach, learning algorithm)
Task (problem domain)

Data Representation Method Task


Feature vector + Label Perceptron Classification
Graph Matrix Factorization Recommendation
Feature vector K-Means Clustering
Multidimensional Analytical queries (e.g., Data understanding,
Dataset (e.g., Data OLAP operations) Data preparation,
Cube) Trend discovery...

5 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
Remember: Feature Vectors

6 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
Remember: Classification Example

Tid Home Marital Annual Defaulted Tid Home Marital Annual Defaulted
Owner Status Income Borrower Owner Status Income Borrower

1 Yes Single 125K No 11 Yes Single 125K ?

2 No Married 100K No 12 No Married 100K ?

3 No Single 70K No 13 No Single 70K ?

4 Yes Married 120K No 14 Yes Married 120K ?

5 No Divorced 95K Yes 15 No Divorced 95K ?

6 No Married 60K No
7 Yes Divorced 220K No
What is the
8 No Single 85K Yes
quality of the
9 No Married 75K No New
10 No Single 90K Yes
learned model? Data

Training
Learn
Model
Set Classifier

7 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
Multidimensional Datasets
Dimensions: Many independent attributes
spanning n-dimensional space
(coordination system)

Geo
Measures: Few dependent measurement
attributes
Time
Example
Time Geo Sex ... Population
2013 Germany F ... 41 673 725
2013 Germany M ... 40 346 853
2013 Spain F ... 23 702 400
2013 Spain M ... 23 001 908
... ... ... ... ...

Eurostat – Population on 1 January by age and sex


http://appsso.eurostat.ec.europa.eu/nui/show.do?dataset=demo_pjan&lang=en
8 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
Problem: Knowledge Discovery in
Multidimensional Datasets

Knowledge Discovery (KD) is the non-trivial process of


identifying valid, novel, potentially useful, and ultimately
understandable patterns in data [Fayyad, 1996]

Challenges to KD in Multidimensional Datasets


distribution of data (Example: sensor data)
heterogeneity of data (Examples: different units of measure)
varying quality of data (Examples: sales versus social network)
growth and size of data (Examples: daily stock-market values)

9 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
Solution: Exploratory KD Techniques

Exploratory KD techniques as part of Data Understanding,


Data Preparation in the knowledge discovery process

KD Process [Fayyad, 1996]


10 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
Exploratory KD Techniques – Defined

Interactive search for Information Seeking Mantra:


interesting patterns:

overview first, zoom and filter, then


details-on-demand
[Shneiderman, 1996]

Extract-Visualize-Analyze
Loop of Analysis
[Gray, 1995]

11 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
Informational Systems – Problem

Overview
Filter
Zoom ?
?

Tools for
Exploratory KD
over ?
Multidimensional
Datasets

?
Pivot tables Multidimensional
Visualisations Datasets

Users Informational System Data Sources


12 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
Other Approaches

Spreadsheets
RDBMS / Operational Systems
Visualisation Techniques

13 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
Spreadsheets
Problem

Economist. 2013. This spreadsheet is different


Manual effort
Trust in data

Eurostat wiki
14 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
RDBMS / Operational Systems

Problem
Complementary requirements [Köppen, 2012] [Golfarelli, 2009]
There is no “One Size fits All“ [Stonebraker, 07]

Operational systems Informational systems


Data sources mostly one many (e.g., all company data)
Data volume MB...GB GB...TB...PB
Data characteristics current, detailed, primary historical, summarised,
data derived, integrated
Query types read, write, update, read, periodical inserts, no
delete updates and deletes
Query simple, short, atomic complex, ad-hoc, aggregating,
characteristics (OLTP) ordering, filtering (OLAP)

15 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
Visualisation
Techniques

Problem:
Multidimensionality

OECD Explorer

Gapminder Visualisation
16 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
Informational Systems – Problem

Overview
Filter
Zoom ?
?

Tools for
Exploratory KD
over ?
Multidimensional
Datasets

?
Pivot tables Multidimensional
Visualisations Datasets

Users Informational System Data Sources


17 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
Example Exploratory KD User Interface

18 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
Example Exploratory KD User Interface

Overview from different perspectives


via Drag & Drop

19 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
Example Exploratory KD User Interface

Filter or Zoom via Menus

20 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
Outline

Exploratory KD Techniques
Analytical Queries over Multidimensional Datasets
Data Cube and OLAP
Data structures and Operations for Analytical Queries
Data Warehouses
Implementing the Informational System

21 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
Multidimensional Datasets in Relational
Databases – Example

Dimensions: Many independent attributes


spanning n-dimensional space

Geo
(coordination system)
Measures: Few dependent measurement
attributes (often aggregated over time or Time
space)
Time Geo Sex ... Population
2013 Germany F ... 41 673 725
2013 Germany M ... 40 346 853
2013 Spain F ... 23 702 400
2013 Spain M ... 23 001 908
... ... ... ... ...

Eurostat – Population on 1 January by age and sex


http://appsso.eurostat.ec.europa.eu/nui/show.do?dataset=demo_pjan&lang=en
22 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
How to Reduce Data to One Dimension?
Histograms

Group-by and aggregations


over computed categories

Group by
Group partitions table into groups.
Group-by illustration [Gray, 1995] Example: Age groups

Aggregation function
Aggregation function summarises
attributes from a group returning a
value for each group.
Example: SUM, COUNT, AVG
Histogram population SUM over age groups 0-18, 18-36, 36-54...
23 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
How to Successively Reduce Dimensionality?
Subtotals
Successively aggregating data over attributes (roll-up)
Input: Aggregation function
Output: Group by
roll-up
Time Geo Sex ... Population by Pop by Pop by
Time/Geo/Sex Time/Geo Time
2013 Germany F ... 41 673 725
2013 Germany M ... 40 346 853
... ... ... ... 82 020 578
2013 Spain F ... 23 702 400
2013 Spain M ... 23 001 908
... ... ... ... 46 704 308
... ... ... ... 128 724 886

Subtotals of population sum


24 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
How to Reduce to Arbitrary Dimensions?
Cross Tabulations („Kreuztabellen“/Pivot Tables)
Symmetric aggregation over
attribute combination.
Population by...
Time/Geo/Sex Time/Sex Geo/Sex
Time/Geo Sex Geo
Time ()

Geo\Sex F M total
Germany 41 673 725 40 346 853 82 020 578
Spain 23 702 400 23 001 908 46 704 308
total 65 376 125 63 348 761 128 724 886 ...
Population 2013 Pivot Table 2012
25 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
How Difficult is it to Compute All Dimension
Reductions? – CUBE Operator [Gray, 95]

Input
Multidimensional dataset in SUM, AVG,
relational database COUNT...
Aggregation function

CUBE Operator

Output
All possible cross tabulations for
multidimensional dataset
Stored in relational database
...
26 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
CUBE Operator –
Example Time Geo Sex Population
2013 Germany F 41 673 725
2013 Germany M 40 346 853
group-by 2013 Spain F 23 702 400
2013 Spain M 23 001 908
... ... ... ...
2013 Germany ALL 82 020 578
... ... ... ...
... ALL ... ...
... ... ... ...
ALL ... ... ...
... ... ... ...
... ALL ALL ...
... ... ... ...
ALL ALL ALL ...

27 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
CUBE Operator –
Example Time Geo Sex Population
2013 Germany F 41 673 725
2013 Germany M 40 346 853
group-by 2013 Spain F 23 702 400
2013 Spain M 23 001 908
... ... ... ...
2013 Germany ALL 82 020 578
group-by
... ... ... ...
... ALL ... ...
... ... ... ...
ALL ... ... ...
... ... ... ...
... ALL ALL ...
... ... ... ...
ALL ALL ALL ...

28 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
CUBE Operator –
Example Time Geo Sex Population
2013 Germany F 41 673 725
2013 Germany M 40 346 853
2013 Spain F 23 702 400
2013 Spain M 23 001 908
Number of ... ... ... ...
group-bys: 2013 Germany ALL 82 020 578
|N | ... ... ... ...
2 ... ALL ... ...
... ... ... ...
ALL ... ... ...
... ... ... ...
... ALL ALL ...
...
... ... ... ...
ALL ALL ALL ...

29 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
How to Analyse Multidimensional Datasets?
Representing and Querying

Conceptual Level
Independent from
representation
Logical Level
Dependent on
representation
Physical Level
Dependent on
setting and actual
data

Three-level-architecture [Ciferri, 2012]

30 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
How to Analyse Multidimensional Datasets?
Representing and Querying

Conceptual Level
Independent from
representation
Logical Level
Dependent on
representation
Physical Level
Dependent on
setting and actual
data

Three-level-architecture [Ciferri, 2012]

31 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
Data Cube

Natural and widely-adopted way to think about


multidimensional datasets [Chaudhuri, 97]
Intuitive analysis
operations (OLAP)

Fact
Dimension Hierarchy

Level
Level Member
Level
Dimension
32 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
Data Cube

Natural and widely-adopted way to think about


multidimensional datasets [Chaudhuri, 97]
Intuitive analysis
operations (OLAP)
DE11

DE1
Example DE12 20 000 ...
DE
Geo

Population DE212
DE2

Data Cube
...
NUTS0

NUTS1
NUTS2

Day 1 ... ... ...


Month January ...
Year 2013
Time
33 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
Data Cube Elements: Member

Member = Set of all members


V ∊ 2Member
m = (name) ∊ Member Dimension
Fact

rollupmember: V -> V
Meaning: „member

Hierarchy
Level
more specific than“ Level Member
Level
Dimension

ALL

Example 2013

Member = {DE, ES, DE1, 2010-01-01,2010...} January 2013


DE = („DE“)
2013-01-01
2013-01-02
34 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
Data Cube Elements: Level
Level = Set of all levels
L ∊ 2Level
m = (name, V, depth) ∊ Level Fact
Dimension
rolluplevel: L -> L
Meaning: „more
specific level than“

Hierarchy
Level
Level Member
Level
Dimension

Example
Level = {NUTS2, NUTS1, NUTS0, Day, Month, ALL ALL
Year}
NUTS0 Year
NUTS0 = („NUTS0“, {DE, ES, ...}, 1)
NUTS1 Month
rolluplevel = {Day -> Month, Day -> Week,
Month -> Year, Year -> ALL} NUTS2 Day

35 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
Data Cube Elements: Hierarchy
Hierarchy = Set of all hierarchies
H ∊ 2Hierarchy
h = (name, V, rolluplevel,
rollupmember) ∊ Hierarchy
Fact
Dimension

A hierarchy has a most granular and an ALL level

Hierarchy
Level
Level Member
The levels form an ordered list Level
Dimension

The members form a tree


ALL ALL

Example
Year 2013
Hierarchy = {geoH, timeH, sexH}
timeH = („timeH“, {Day, Month, Year, Month
January 2013
ALL},{Day -> Week, Week -> Year,
Year -> All}, {2013-01-01 -> 2013- Day = ⏊ 2013-01-01
01, 2013-01 -> 2013, 2013 -> ALL}) 2013-01-02
36 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
Data Cube Elements: Dimension

Dimension = Set of all dimensions


D ∊ 2Dimension
d = (name, H) ∊ Dimension Dimension
Fact

Hierarchy
Level
Level Member
Example Level
Dimension

Dimension = {geoD, timeD, sexD, ...}


timeD = („timeD“,{timeH})

37 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
Data Cube Elements: Measure
Measure = Set of all measures
M ∊ 2Measure
m = (name, aggfunc) ∊ Measure Fact
aggfunc ∊ {SUM, AVG, COUNT...} Dimension

Hierarchy
Level
Level Member
Example Level
Dimension

Measure = {populationMeasSum, ...}


populationMeasSum = („populationMeasSum“, SUM)

38 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
Data Cube Elements: Fact
Fact = Set of all facts

F ∊ 2Fact Fact
Dimension

fact = (name, C, E) ∊ Fact

Hierarchy
Level
Fact = String x 2Dimension x Member x Level Member
Level
2Measure x Number Dimension

Example
Fact = {fact1, fact2, fact3, ...}

fact1 = („fact1“, {(geoD, DE), (timeD, 2013),(sexD, F)},


{(populationMeasSum, „41,673,725“)...})

fact2 = („fact2“, {(geoD, DE), (timeD, 2013),(sexD, M)},


{(populationMeasSum, „40,346,853“)...})
39 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
Data Cube Elements: Data Cube Schema
DataCubeSchema = Set of all data
cube schemas
Fact

CS ∊ 2DataCubeSchema Dimension

cs = (name, D, M) ∊ DataCubeSchema

Hierarchy
Level
Level Member
Level
DataCubeSchema = String x 2Dimension Dimension

x 2Measure

Example
DataCubeSchema = {populationCS, gdpCS}

populationCS = (populationCS, {geoD, timeD, sexD},


{populationMeasSum,...})

40 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
Data Cube Elements: Data Cube
DataCube = Set of all data cubes
cube = (name, Cs, F) ∊ DataCube
DataCube = String x
Fact
DataCubeSchema x 2Fact Dimension

Every fact ∊ F has a value for each of the


dimensions and measures

Hierarchy
Level
Level Member
Level
Every two facts must not Dimension
have same dimension values

Every fact only describes members on most granular (bottom) levels

Example
DataCube = {populationC, gdpC}
populationC = („populationC“, populationCS, {fact1,
fact2, fact3...})

41 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
How to Analyse Multidimensional Datasets?
Representing and Querying (2)

Conceptual Level
Independent from
representation
Logical Level
Dependent on
representation
Physical Level
Dependent on
setting and actual
data

Three-level-architecture [2]

42 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
Representing Data Cubes

Relational Database: Star Schema, Snowflake Schema...


Multidimensional Database: Array
Triple Store: The RDF Data Cube Vocabulary
...

43 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
Star Schema (“Sternschema”)

Advantages
Intuitive transformation from data cube to star schema.
Easy to implement in widely-used relational databases.
Fast queries since partly denormalised and few required
joins.
Extensions and changes to the schema easy to realise.
Elements of star schema
Fact tables (large)
Dimension tables (small)

44 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
How to Create a Star Schema of a Data Cube?
For every data cube, one fact table
For every fact, one row in fact table
For every measure, a (numeric) attribute in fact table
For every dimension, a dimension table with primary key and a foreign
key in fact table (all foreign keys = primary key)
For every level, an attribute in dimension table
For every member with highest granularity, a row in a dimension table
geoD
Example 1
timeD
geoID 1
NUTS2 dateID
NUTS1 Day
populationC Week
NUTS0 * Month
Dimension table for Geo geoID Year
timeID *
sexD * sexID Dimension table for Time
population
sexID 1
sex Fact table for Population Data Cube
Dimension table for Sex

45 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
Challenges with Representing Data Cubes

Conformed dimensions
Complex hierarchies (e.g., non-strict)
Aggregation functions (e.g., meaningless: „Sum of
population over time“)
Other metadata (e.g., human-readable labels)
Pre-aggregated values (e.g., often-used summarisations)

46 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
Conformed Dimensions
Example
Integrating several DE Population
Data Cube
data cubes ES ...

Geo
...

...

NUTS0
Day 1 ... ... ...
Month January ...
Equivalent Year 2013
Conformed Partially shared hierarchy
dimensions Dimension

NUTS1
NUTS0

NUTS2
[Kimball, 2002]
DE11 GDP

DE1
Data Cube
DE12 ...
Geo

DE DE212

Total/Partial sharing DE2 ...

of hierarchies Day 1 ...


Month January
... ...
...
Year 2013
Time

47 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
Informational Systems – Problem

Overview
Filter
Zoom ?
?

Tools for
Exploratory KD
over ?
Multidimensional
Datasets

?
Pivot tables Multidimensional
Visualisations Datasets

Users Informational System Data Sources


48 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
Informational Systems – Problem

Overview
Filter
Zoom
?
?

Tools for
Exploratory KD
over
Multidimensional
Datasets

Data Cube
Pivot tables Multidimensional
Visualisations Datasets

Users Informational System Data Sources


49 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
Outline

Exploratory KD Techniques
Analytical Queries over Multidimensional Datasets
Data Cube and OLAP
Data structures and Operations for Analytical Queries
Data Warehouses
Implementing the Informational System

50 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
How to Analyse Multidimensional Datasets?
Representing and Querying (3)

Conceptual Level
Independent from
representation
Logical Level
Dependent on
representation
Physical Level
Dependent on
setting and actual
data

Three-level-architecture [Ciferri, 2012]

51 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
OLAP Operations

Widely-adopted analysis operations over


Data Cubes [Gómez, 2012]
Exploratory KD Technique
Overview (Slice, Roll-Up)
Filter/ Zoom (Projection, Dice)
Advantages Dice de
No need to learn complicated query language
pop
Query results can be displayed in pivot tables Projection SUM
and further analysed.
Algebra: nested set of operations possible, e.g.,
Dice(Projection(...))
Input: Data Cube
Output: Data Cube

52 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
Example OLAP User Interface

53 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
Example OLAP User Interface

Overview from different perspectives


via Drag & Drop (Slice, Roll-Up)

54 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
Example OLAP User Interface

Filter or Zoom via Menus


(Projection, Dice)

55 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
Example Star Schema for Data Cube

Example
Time (Year) Geo (NUTS2) gdpMeasSum gdpMeasAvg ...

2010 DE12 88 576 ... ...

2010 DE11 140 737 ... ...

2010 DE21 174 444 ... ...

... ... ... .... ...

We assume star schema representation

Regional gross domestic product (million PPS) by NUTS 2 regions


http://epp.eurostat.ec.europa.eu/tgm/table.do?tab=table&init=1&language=en&pcode=tgs00004

56 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
Projection() – Filter Measures

Projection: DataCube x 2Measure -> DataCube with


Projection(c, PM)
c‘.DataCubeSchema.M = c.DataCubeSchema.cs.M\PM

Example
Projection(gdpC, {gdpMeasSum})

Time Geo gdpMeasSum

2010 DE12 88 576

2010 DE11 140 737

2010 DE21 174 444

... ... ...

57 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
Dice() – Filter Dimensions

Dice: DataCube x Dimension x 2Member -> DataCube with


Dice(c, dd, DM)

Example
Dice(gdpC,geoD,{DE21,DE11})

SELECT year nuts2 gdpM


FROM gdpC, timeD, geoD
WHERE
gdpC.timeID = timeD.timeID
AND gdpC.geoID = geoD.geoID
AND nuts2 = „DE21“ OR nuts2
= „DE11“

58 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
Slice() – Remove Dimensions
Dice: DataCube x 2Dimension -> DataCube with
Slice(c, SD) = c‘
c‘.DataCubeSchema.D = c.DataCubeSchema.D\SD

Example
Slice(gdpC, {geoD})

SELECT year SUM(gdpM)


FROM gdpC, timeD
WHERE gdpC.timeID =
timeD.timeID
GROUP BY year

59 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
Roll-Up() – Aggregate to Higher Level

Roll-Up: DataCube x Dimension x Level -> DataCube


with
Roll-Up(c, rd, rl) = c‘

Example
Roll-Up(gdpC, geoD, NUTS1)

SELECT year nuts1 SUM(gdpM)


FROM gdpC, timeD, geoD
WHERE gdpC.timeID = timeD.timeID
AND gdpC.geoID = geoD.geoID
GROUP BY year nuts1

60 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
Drill-Across() – Integrate Two Data Cubes
Drill-Across: DataCube x Dimension -> DataCube with
Drill-Across(c1, c2) = c‘
„Drill-across over conformed dimensions and (total/partial) shared hierarchies“
c‘.DataCubeSchema.D = c1.DataCubeSchema.D ∪ c2.DataCubeSchema.D
c1.DataCubeSchema.M = c1.DataCubeSchema.M ∪ c2.DataCubeSchema.M

Example
Drill-Across(gdpC, populationC)
SELECT year nuts0 sex SUM(gdpM)
SUM(populationM)
FROM gdpC, populationC, timeD,
geoD, sexD
WHERE gdpC.timeID = timeD.timeID
AND gdpC.geoID = geoD.geoID AND
Problems: Partial- populationC.timeID = timeD.timeID
shared hierarchies. AND populationC.geoID = geoD.geoID
Non-conformed AND populationC.sexID = sexD.sexID
dimensions GROUP BY year nuts0 sex
61 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
Drill-Across() – Integrate Two Data Cubes
Drill-Across: DataCube x Dimension -> DataCube with
Drill-Across(c1, c2) = c‘
„Drill-across over conformed dimensions and (total/partial) shared hierarchies“
c‘.DataCubeSchema.D = c1.DataCubeSchema.D ∪ c2.DataCubeSchema.D
c1.DataCubeSchema.M = c1.DataCubeSchema.M ∪ c2.DataCubeSchema.M

Example

Solution: Nested operations

Drill-Across(
Problems: Partial- Roll-Up(gdpC, geoD, nuts0),
shared hierarchies. Slice(populationC, {sexD})
Non-conformed )
dimensions
62 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
How to Analyse Multidimensional Datasets?
Representing and Querying (4)

Conceptual Level
Independent from
representation
Logical Level
Dependent on
representation
Physical Level
Dependent on
setting and actual
data

Three-level-architecture [Ciferri, 2012]


How to provide such functionality in an informational system?
Data Warehouses
63 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
Informational Systems – Problem

Overview
Filter
Zoom
?
?

Tools for
Exploratory KD
over
Multidimensional
Datasets

Data Cube
Pivot tables Multidimensional
Visualisations Datasets

Users Informational System Data Sources


64 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
Informational Systems – Problem

Overview
Filter
Zoom
OLAP
?
Operations
Tools for
Exploratory KD
over
Multidimensional
Datasets

Data Cube
Pivot tables Multidimensional
Visualisations Datasets

Users Informational System Data Sources


65 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
Informational Systems – Problem

Overview
Filter
Zoom
OLAP
?
Operations
Tools for
Exploratory KD
over
Multidimensional
Datasets

Data Cube
Pivot tables Multidimensional
Visualisations Datasets

Users Informational System Data Sources


66 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
Outline

Exploratory KD Techniques
Analytical Queries over Multidimensional Datasets
Data Cube and OLAP
Data structures and Operations for Analytical Queries
Data Warehouses
Implementing the Informational System

67 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
Data Warehouse –
Purpose and Requirements
A Data Warehouse is a
subject-oriented,
integrated,
non-volatile and
time-variant
collection of data in support of management‘s decision-
making process. [Inmon, 1996]
FASMI requirements to Data Warehouses [Pendse, 95]
Fast: Returns results within typically 5sec for interactive analysis.
Analysis: Allows useful ad-hoc analytical queries. Most often: OLAP.
Shared: Several users with different rights.
Multidimensional: Adequate representation of multidimensional
datasets. Most often: Data Cube.
Information: Integration of all relevant metadata and data.

68 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
Data Warehouse
Data Flow [Köppen, 2012]

Data Extraction Work Load Base Load

sources space database

Transformation Analysis

ETL Data Cube

Data flow Data Warehouse

Control flow

69 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
Data Warehousing vs. Federated Data

Pro Data Warehousing


Performance
Reliability / Robustness Analysis
DW

Pro Federated Data


Flexible to ad-hoc query new
data source.
Analysis
Data freshness

http://semanticweb.com/defending_the_warehouse_b17223
70 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
Data Warehousing vs. Federated Data

Pro Data Warehousing


Performance
Reliability / Robustness

Analysis SAP HANA


Pro Federated Data
Flexible to ad-hoc query new
data source.
Data freshness

http://semanticweb.com/defending_the_warehouse_b17223
71 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
Data Warehouse
Data Flow [Köppen, 2012]

Data Extraction Work Load Base Load

sources space database

Data sources
Task: Supply data for data warehouse
Not part ofTransformation
the data warehouse Analysis

Internal or external to the company


ETL Data Cube
Heterogeneous wrt. structure, content, interfaces
Factors for choice of sources
Data flow Purpose of data warehouse
Quality of source data
Availability (legally, socially, technically)
Costs of data acquisition (especially for external sources)
Control flow

72 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
Data Warehouse
Data Flow [Köppen, 2012]

Data Extraction Work Load Base Load

sources space database


Extraction component (Wrappers)
Task: Access sources, extract data, prepare data for usage
Present allTransformation
data in a specific format Analysis

E.g., tables, XML


ETL
Operating mode depends on Data Cube
monitoring strategy
Periodically
On demand
Data flow Event-driven (e.g. when a pre-defined number of changes has
occurred)
Directly when a change occurs

Control flow
Realisation
Usage of standard interfaces (e.g. ODBC)
73 12.11.2014 Exception
Knowledge Discovery WS 2014/15handling for
- Data Warehousing continuation in case of error
& OLAP Institut AIFB
Data Warehouse
Data Flow [Köppen, 2012]
Transformation component (Mediators)
Task: Preparation and adaptation of
Extraction Work Load
data forBase
loading Load
Data
sources space Transform all data to a consistent
database
schema (Data Integration)
Align data types, dates, units of
Transformation measure, encodings... Analysis
Identify equivalent entities
ETL Dataconversions
Apply Cube etc.
Elimination of impurity (Data
Cleaning)
Data flow Incorrect or missing values,
redundancy, out-dated values
Use domain knowledge (e.g.
Control flow
Business Rules) for finding
impurities, redundancies, ...
74 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Data Mining e.g. for deviations
Institut AIFB
Data Warehouse
Data Flow [Köppen, 2012]

Data Extraction Work Load Base Load

sources space database

Load component
Transformation Analysis
Task: Transfer of cleaned and preprocessed (e.g. aggregated) data into
base database/data cube
ETL Data Cube
Characteristics
Usage of special load tools (e.g. bulk loader of database)
Changes to data warehouse data may not overwrite data warehouse
data; instead data has to be stored additionally (history of data)
Load process:
Online: Base database resp. data cube available during loading
Offline: Database not available (Time frame: during nights, week-ends)
75 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
Data Warehouse
Data Flow [Köppen, 2012]

Data Extraction Work Load Base Load

sources space database

Work space
Task: Central acquisition component in so-called staging area
Transformation Analysis

Temporary storage for integration


ExecutionETL Data
of transformations (e.g., data Cube
cleaning,
integration) directly in temporary storage
Load transformed data in base database/data cube only
after successful completion of transformations
Advantages: no influencing of sources and data cube and no
transfer of error-prone data

76 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
Data Warehouse
Data Flow [Köppen, 2012]
Base database
Task: Supply data cube with cleaned
Data Extraction Work Load Base Load
data space database
sources
Non-temporary storage for pre-
processed data
Independent of concrete analyses,
Transformation Analysis

i.e. no aggregations
ETL redundant
Since containing Data Cube
information to data cube, often
omitted in real applications

77 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
Data Warehouse
Data Flow [Köppen, 2012]

Data Extraction Work Load Base Load

sources space database

Data Cube (Data Mart)


Task: Database for analysis purposes; structure depends on
analysis requirements Transformation Analysis

Conceptual – Logical – Physical Layer


ETL Data Cube
Most often: RDBMS (ROLAP), Array (MOLAP)
FASMI requirements
Interface for analysis tools, e.g. via OLE DB for OLAP, XMLA,
MDX.

78 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
Data Warehouse
Data Flow [Köppen, 2012]

Data Extraction Work Load Base Load

sources space database

Analysis tools
Task: GUI for Exploratory KD
Transformation Analysis

Mapping from user interactions to analytical queries


Drag ETL
& Drop, Menus... Data Cube
Presentation of results:
Pivot tables, Reports, Dashboards, Visualisations etc.

79 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
Data Warehouse – Reference Architecture

Data Extraction Work Load Base Load

sources space database

Transformation Analysis

Metadata
Manager
Data
Monitor
Data flow Monitor Warehouse
Monitor Manager
Repository

Control flow Data Warehouse


80 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
Data Warehouse – Reference Architecture –
Data Warehouse Manager

Data Extraction Work Load Base Load

sources space database

Transformation Analysis

Metadata
Manager
Data
Monitor
Data flow Monitor Warehouse
Monitor Manager
Repository

Control flow Data Warehouse


81 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
Data Warehouse – Reference Architecture –
Data Warehouse Manager (1)

Data Warehouse Manager


DataInitiates,Extraction Work Load Base Load
Task: controls and monitors
sources space database
the various processes
(“Ablaufsteuerung”)
Initiates load process
Transformation Analysis
Coordinates processing order
Monitors further steps
Ensures error documentation and Metadata
Manager
recovery
Data
Monitor
Monitor Warehouse
Control flow Monitor Manager
Repository

Data Warehouse
82 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
Data Warehouse – Reference Architecture –
Monitors

Data Extraction Work Load Base Load

sources space database

Transformation Analysis

Metadata
Manager
Data
Monitor
Data flow Monitor Warehouse
Monitor Manager
Repository

Control flow Data Warehouse


83 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
Data Warehouse – Reference Architecture –
Monitors
Monitors
Task: Discover changes of a data source and
ensures eventual propagation to data cube
Extraction Work Load Base Load
Data
space
Cooperative data sources
database
sources
Trigger-based: Source copies changed data
to a separate region.
Replication-based: Making use of data flow
Transformation Analysis
for source’s internal replication process.
Non-cooperative data sources
Metadata
Log-based: Use transaction log of DBMS
Manager
Time-stamp-based:
Data Compare tuple
Monitor
Data flow Monitor changesWarehouse
Monitor Manager
Snapshot-based: Comparison of periodical
Repository
database copies

Control flow Data Warehouse


84 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
Data Warehouse – Reference Architecture –
Metadata Manager and Repository

Data Extraction Work Load Base Load

sources space database

Transformation Analysis

Metadata
Manager
Data
Monitor
Data flow Monitor Warehouse
Monitor Manager
Repository

Control flow Data Warehouse


85 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
Data Warehouse – Reference Architecture –
Metadata Manager and Repository
Repository
Task: Administrate and store Information which
simplifies construction, maintenance and usage of
data warehouse system (metadata)
Requires
Data flexible schema (e.g.,
Extraction Work usingLoad
XML) Base Load

sources space database


Metadata
Database schemas
Data Cube Metadata queries
Transformation Analysis
getCubes()
getDimensions()
Metadata
getMeasures() Manager
... Monitor
Data
Data Monitor Warehouse
ETLflow
background information
Monitor Manager
Versioning information Repository
Administration settings
Access
Control Data Warehouse
policies
flow
Usage statistics
86 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
Informational Systems – Problem

Overview
Filter
Zoom
OLAP
?
Operations
Tools for
Exploratory KD
over
Multidimensional
Datasets

Data Cube
Pivot tables Multidimensional
Visualisations Datasets

Users Informational System Data Sources


87 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
Informational Systems – Problem

Overview
Filter
Zoom Data
OLAP Warehouses &
Operations ETL
Tools for
Exploratory KD
over
Multidimensional
Datasets

Data Cube
Pivot tables Multidimensional
Visualisations Datasets

Users Informational System Data Sources


88 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
Case Study: Can we confirm or oppose
Okun‘s law?

Real GDP Growth Rate Employment Growth

89 12.11.2014
http://km.aifb.kit.edu/projects/ldcx/ (Kämpgen et al.Institut
Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP
2014) AIFB
Case Study: Can we confirm or oppose
Okun‘s law? (2)

90 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
Case Study: Can we confirm or oppose
Okun‘s law? (3)

Pearson-Correlation: 0.851548866822179

91 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
Case Study: Financial Information Observation
System (FIOS) – Components
OLAP engine
(OLAP4LD)

Data Extraction Work Load Base Load

sources space database

Eurostat Database Triple Store


(embedded
Transformation
Sesame) Analysis
xmla4js
Wrappers (Estatwrap) OLAP client
Metadata
Manager
Data OLAP engine
Monitor (OLAP4LD)
Data flow Monitor Warehouse
Monitor Manager
Directed Crawler Repository
(OLAP4LD)
Control flow Data Warehouse
92 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
Informational Systems – Problem

Projection, Dice, Estatwrap


Overview
Eurostat- Slice, Roll-Up,
Explorer, OLAP4LD
Filter Drill-Across Eurostat
Zoom xmla4js Data Database
OLAP Warehouses &
Operations ETL
Tools for
Exploratory KD
over
Multidimensional
Datasets

Data Cube Multidimensional


Pivot tables
Visualisations Histograms, Subtotals, Datasets
Crosstabs, CUBE
operator
Users Informational System Data Sources
93 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
References & Further Reading
[Chaudhuri, 97] Chaudhuri, S., & Dayal, U. (1997). An overview of data
warehousing and OLAP technology. ACM SIGMOD Record, 26(1), 65–74.
doi:10.1145/248603.248616
[Ciferri, 2012] Ciferri, C., Ciferri, R., Gómez, L., & Schneider, M. (2012). Cube
Algebra: A Generic User-Centric Model and Query Language for OLAP Cubes.
IJDWM.
[Fayyad, 1996] Fayyad, U., & Piatetsky-Shapiro, G. (1996). From data mining
to knowledge discovery in databases. AI Magazine, 17, 37.
[Golfarelli 2009] Golfarelli, M., Rizzi, S.: Data Warehouse Design: Modern
Principles and Methodologies. Mcgraw-Hill Professional (2009).
[Gómez, 2012] Gómez, L. I., Gómez, S. A., & Vaisman, A. A. (2012). A Generic
Data Model and Query Language for Spatiotemporal OLAP Cube Analysis. In
EDBT 2012.
[Gray, 1995] Gray, J., Bosworth, a., Lyaman, A., & Pirahesh, H. (1995). Data
cube: a relational aggregation operator generalizing GROUP-BY, CROSS-
TAB, and SUB-TOTALS. Proceedings of the Twelfth International Conference
on Data Engineering, 152–159. doi:10.1109/ICDE.1996.492099

94 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
References & Further Reading (2)
[Inmon, 1996] Inmon, W.h.: Building the Data Warehouse. John Wiley & Sons,
Inc. 2. Auflage, 1996.
[Kimball, 2002] Kimball, R., & Ross, M. (2002). The data warehouse toolkit: the
complete guide to dimensional modelling. Nachdr.]. New York [ua]: Wiley, 1–
447. doi:10.1145/945721.945741
[Köppen, 2012] Köppen, V., Saake, G., Sattler K.: Data Warehouse
Technologien. 2012.
[Pendse, 95] Pendse, N.: The FASMI Definition for OLAP. Business
Intelligence, August 1995.
[Stolte, 2002] Stolte, C., Tang, D., & Hanrahan, P. (2002). Query, analysis, and
visualization of hierarchically structured data using Polaris. Proceedings of the
eighth ACM SIGKD international conference on Knowledge discovery and data
mining - KD ’02, 112. doi:10.1145/775063.775064
[Shneiderman, 1996] Shneiderman, B. (1996). The Eyes Have It : A Task by
Data Type Taxonomy for Information Visualizations. Information Visualization,
336–343.
Benedikt Kämpgen, Andreas Harth. OLAP4LD - A Framework for Building
Analysis Applications over Governmental Statistics. ESWC 2014 Posters &
Demo session, Springer, Mai, 2014.
95 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB
Knowledge Discovery Lecture WS14/15
22.10.2014 Einführung
Basics, Overview
29.10.2014 Design of KD-experiments
05.11.2014 Linear Classifiers
12.11.2014 Data Warehousing & OLAP
19.11.2014 Non-Linear Classifiers (ANNs) Supervised Techniques,
26.11.2014 Kernels, SVM Vector+Label Representation
03.12.2014 entfällt
10.12.2014 Decision Trees
17.12.2014 IBL & Clustering Unsupervised Techniques
07.01.2015 Relational Learning I
Semi-supervised Techniques,
14.01.2015 Relational Learning II
Relational Representation
21.01.2015 Relational Learning III
28.01.2015 Textmining
04.01.2015 Gastvortrag Meta-Topics
11.02.2015 Crisp, Visualisierung

96 12.11.2014 Knowledge Discovery WS 2014/15 - Data Warehousing & OLAP Institut AIFB

You might also like