You are on page 1of 59

Data Warehousing and

Business Intelligence
Lecturer: Jiaheng Lu

Department of Computer Science


University of Helsinki

1
Outline
• About the course
• Databases and the limits with OLTP databases
• What is a Data Warehouse?
• Components of a Data Warehouse
• ETL phases

13.3.2024 2
Main topics of this course

•This course will cover selected topics of


databases and big data, including:
•Data warehousing
•OLAP and Business intelligence
•NoSQL databases
•Multi-model databases

13.3.2024 3
Prerequisite course
• Introduction to databases
• Bachelor-level
• Suppose that you are familiar with
• Database SQL query formulation
• Transaction and ACID properties
• Hands-on experience to run SQL queries with databases
Schedule of the course 2024
Wednesday Friday
Week 1 Lecture 1: Data warehousing Reading paper together
Week 2 Lecture 2: OLAP and BI (I) Tutorial 1
Week 3 Lecture 3: OLAP and BI (II) Tutorial 2
Week 4 Lecture 4: Big data and data warehouse Tutorial 3
Week 5 Lecture 5: NoSQL databases Lecture 6: Multi-model databases
Week 6 Question and Answer (QA) session Student presentation(26.04)
Week 7 Holiday Student presentation(03.05)

13.3.2024 5
Attendances
• To pass this course, the compulsory requirement is to submit the
answers of four exercises and give an online presentation.

• Attending the reading-paper session and answering the questions are


also compulsory, but all other sessions are optional attendances.

• No final examination
Grading
Parts Points (Up to)
Four exercises 69
Reading-paper 3
Presentation 25
Feedback on Presentation 3

All exercises will be published in Moodle.


Exercise 1 is out.

13.3.2024 7
Grading of this course

• Score the Grade


• <51 Abandoned
• 51-60 1
• 61-70 2
• 71-80 3
• 81-90 4
• 91-100 5

13.3.2024 8
Textbook (1)
• Fundamentals of database systems
• Elmasri Ramez, Navathe Shamkant B.
• 2017 Seventh edition, Global edition.
• This book is available online in our university
library
• Chapter 24, 25, 29

13.3.2024 9
Textbook (2)
• Principles of Database Management: The Practical Guide to Storing,
Managing and Analyzing Big and Small Data

• Part 4: Data Warehousing, Data Governance and (Big) Data Analytics

• https://www.pdbmbook.com/
Recommended books and links
 “The data warehouse toolkit: the complete guide to dimensional
modeling”. John Wiley & Sons, 2013.
→ Author: Ralph Kimball and Margy Ross.

 “Building the data warehouse”. John Wiley & Sons, 2005.


→ Author: William H Inmon

Useful links:

https://www.1keydata.com/datawarehousing/datawarehouse.html
Leaning objective: Week 1
• Understand the limits with OLTP databases
• Can explain the four characters of data warehouse
• Know the main components of a data warehouse
• Can compare different terms including standard DB, data
warehousing, OLTP, heterogeneous databases
• Understand ETL phases and their main issues
Leaning objective: Week 2
• Understand the star schema, snowflake schema, fact constellations
• Can formulate OLAP operations: drilling, rolling, slicing, dicing and
pivoting
• Understand MOLAP, ROLAP and HOLAP
Leaning objective: Week 3
• Advanced SQL queries for data analytics
• Understand Business intelligence (BI) and its application
• Understand independent and dependent data marts
Leaning objective: Week 4
• Understand the difference between Inmon versus Kimball approaches
for data warehousing
• Know the six V’s of big data
• Understand Lambda and Kappa big data architectures
• Know the main products for data warehousing and BI
• Understand the difference between real-time data warehouse and
traditional data warehouse
Leaning objective: Week 5
• Know the difference between ACID and BASE
• Understand CAP theorem
• Understand various data model and the operations, including
relational, semi-structured, graph data models
• Know the four NoSQL database Key-value, document, wide-column
and graph and their different data store approaches
Leaning objective: Week 6
• Know the motivation for multi-model databases and polystores
• Understand the current approaches for multi-model data storage and
query
• Understand a unified categorical model for multi-model database
AI tool for the course
• We follow the university level guidelines.
• In particular, if you use a language model to produce the work you are
returning, you must report in writing which model (e.g. ChatGPT, Bing AI) you
have used and in what way.
• Failing to report the use of a language model as instructed is treated as
cheating.
• Watch an introductory video about data warehouses

• https://www.youtube.com/watch?v=AHR_7jFCMeY
Outline
• About the course
• Databases and the limits with OLTP databases
• What is a Data Warehouse?
• Components of a Data Warehouse

13.3.2024 20
What is a Database?

• A database is a collection of related data.

• For example: names, telephone numbers, and addresses of the


people.

• This collection of data is stored on a hard drive using a database.

13.3.2024 21
Database is more than a random collection of
data
• A database represents some aspect of the real world (not random
data). Changes to the world are reflected in the database.

• A database is a logically coherent collection of data.

13.3.2024 22
DBMS and OLTP
• DBMS is a general-purpose software that facilitates the process of defining,
constructing, manipulating, and sharing databases among users and
applications.

• OLTP (Online Transactional Processing) is a category of data processing


that is focused on transaction-oriented tasks. OLTP typically involves
inserting, updating, and/or deleting small amounts of data in a database.
• Examples of OLTP transactions include: Online banking, Purchasing a book
online, Booking an airline ticket, Order entry.

13.3.2024 23
The limits with OLTP databases
• Operational (OLTP) databases are designed to keep
transactions from daily operations. It is optimized to
efficiently update or create individual records
• Limits:
• Transactional systems were not designed for decision support
analysis
• Data constantly changes on transactional systems and OLTP Lack of
historical data.
What is a Data Warehousing?
• Data warehousing is an architectural model designed to gather data
from various sources into a single unified data model for analysis
purposes.
• Term was introduced in 1990 by William Immon
• In a data warehouse, the data is:
• Subject Oriented
• Integrated
• Time Variant
• Non Volatile
Data Warehouse—Subject-Oriented
• Organized around major subjects, such as customer, product,
sales.
• Focusing on the modeling and analysis of data for decision
makers, not on daily operations or transaction processing.
• Provide a simple and concise view around particular subject
issues by excluding data that are not useful in the decision
support process.

26
• Image link: https://handbook.magestore.com/books/data-warehouse---tutorial/page/data-
warehouse-tutorial
Data Warehouse—Integrated
• Constructed by integrating multiple, heterogeneous data
sources
• relational databases, flat files, on-line transaction records
• Data cleaning and data integration techniques are applied.
• Ensure consistency in naming conventions, encoding
structures, attributes, etc. among different data sources
• E.g., currency, tax, etc.
• When data is moved to the warehouse, it is converted.

28
Data Warehouse—Time Variant
• The time horizon for the data warehouse is significantly longer than that of
operational systems.
• OLTP database: current value data.
• Data warehouse data: provide information from a historical perspective
(e.g., past 5-10 years)
• Contains an element of time, explicitly or implicitly

29
Data Warehouse—Non-Volatile
• A physically separate store of data transformed from the
operational environment.
• Operational update of data does not occur in the data warehouse
environment.
• Does not require transaction processing, recovery, and
concurrency control mechanisms
• Requires only two operations in data accessing:
• initial loading of data and access of data.
30
Data Warehouse vs. Heterogeneous DBMS
• Traditional heterogeneous DB integration:
• Build wrappers/mediators on top of heterogeneous databases
• Query driven approach
• When a query is posed to a client site, a meta-dictionary is used to
translate the query into queries appropriate for individual heterogeneous
sites involved, and the results are integrated into a global answer set.
• Data warehouse: high performance
• Information from heterogeneous sources is integrated in advance and stored
in warehouses for direct query and analysis. No wrapper/mediators.

31
Heterogeneous DBMS

Figure source link: https://www.researchgate.net/figure/the-IGN-E-case-of-


heterogeneous-databases_fig8_226823712
Data Warehousing
• Not a product, it is a process
• Combination of hardware and software
• Can often be set up as one VLDB (Very Large Database) or a collection
of subject areas called Data Marts.

Image link: https://corporatefinanceinstitute.com/resources/knowledge/other/data-warehousing/


• Answer some questions online:

• https://pollev.com/jiahenglu471
Components of a Data Warehouse

Components:
• Hardware
• Database Management System
• Front and End Access Tools and other tools
Components of a Data Warehouse - Hardware

• Power - # of Processors, Memory, I/O Bandwidth,


• Availability – Redundant equipment
• Disk Storage - Speed and enough storage for the loaded data set
• Backup Solution - Automated and be able to allow for incremental
backups and archiving older data
Components of a Data Warehouse - DBMS

• Physical storage capacity of the DBMS


• Loading, indexing, and processing speed
• Handle your data needs
• Operational integrity, reliability, and manageability
Components of a Data Warehouse - Front End & Other
Tools
• Query Tools (SQL & GUI based)
• Report Writers
• Metadata Repositories
• OLAP (Online Analytical Processing)
• Data Mining Products
Metadata Repositories

Metadata is Data about Data. Users and developers often need a way to
find information on the data they use.
Information can include:
• Source System(s) of the Data, contact information
• Related tables or subject areas
• Programs or Processes which use the data
• Population rules (Update or Insert and how often)
• Status of the Data Warehouse’s processing and condition
• ……
Data warehouse metadata

Source: https://link.springer.com/referenceworkentry/10.1007/978-0-387-39940-9_912
Data Mining

• Analyzes great amounts of data (usually contained in a Data


Warehouse) and looks for trends in the data
• Technology now allows us to do this better than in the past, enhanced
with machine learning techniques.
Key Data Mining Techniques
• Clustering.
• Association.
• Classification.
• Machine Learning.
• Prediction.
• Deep Neural Networks……..
OLTP vs. Data Warehousing
• Organized by transactions vs. Organized by particular subject
• More number of users vs. less
• Accesses few records vs. entire table
• Smaller database vs. Large database
• Continuous update vs. periodic update
Data Warehouse vs. standard DB
Standard DB Data Warehouse
• Mostly updates Mostly reads
• Many small transactions Queries are long and complex
• Mb - Gb of data Gb - Tb of data
• Current snapshot History
• Index/hash on primary keys Lots of scans
• Raw data Summarized, reconciled data
• Thousands of users Hundreds of users (e.g., decision-
makers, analysts)
• Read slides from page 46 to 55, and answer questions.
ETL phases

Three Steps :

1. Extraction Phase: Get the data

2. Transformation Phase: Make it useful

3. Loading Phase: Save it to the warehouse


ETL (1)

Extraction Phase:
• Source systems export data via files or populates directly when the
databases can “talk” to each other
• Transfers them to the Data Warehouse server and puts it into some
sort of staging area
Issues:
• Warehouse uses relational data model or multi-dimensional
data model (e.g., data cube)
• On the other hand, different data models:
• Relational, hierarchical, graph
• How do we get the data out?
• We will discuss it with multi-model databases in this couse.
ETL (2)

Transformation Phase:
• Takes data and turns it into a form that is suitable for insertion into
the warehouse
• Combines related data
• Removes redundancies
• Use common codes (Commercial Customer)
• Clean spelling mistakes
• Consistency (e.g. PA,Pa,Penna,Pennsylvania)
• Formatting (e.g. addresses)
ETL (3)

Loading Phase:
• Places the cleaned data into the DBMS in its final, useable form
• Compare data from source systems and the Data Warehouse
• Document the load information for the users
Example ETL Process
Item Customer
records records

Split Filter
Filter Filter Group by
Date - Join non -
invalid invalid customer
time match
Customer
Invoice
balance
line items
Invalid Invalid Invalid
dates /times items customers

• This is an example for e-commerce loading


Data Monitors
• Goal: Detect changes of interest and propagate to users
• How?
• Triggers
• Compare query results
• Compare snapshots/dumps
Data Integration
• Receive data (changes) from multiple wrappers and integrate into
warehouse
• Rule-based
• Actions
• Resolve inconsistencies
• Eliminate duplicates
• Summarize data
• etc.
Data Cleansing
• Find (& remove) duplicate tuples
• e.g., Jane Doe vs. Jane Q. Doe
• Detect inconsistent, wrong data
• Attribute values that don’t match
• Patch missing, unreadable data
• Notify sources of errors found
Data cleansing example:

• Example link: https://quantdare.com/data-cleansing-and-transformation/


• Answer some questions online:

• https://pollev.com/jiahenglu471
Learning objectives of this lecture
• Understand the limits with OLTP databases
• Can explain the four characters of Data Warehouse
• Know the main components of a Data Warehouse
• Understand ETL phases and their main issues
• Can compare different terms including standard DB, data warehousing,
OLTP, heterogeneous databases
Paper reading on Friday (3 points)
 An Overview of Data Warehousing and OLAP Technology

◼ Please read the paper and questions before attending the


session. Students will be separated into different groups for
discussion and come back together to answer the assigned
questions together.
◼ Please submit your answers to the assigned questions in Moodel
to receive 3 points.
Homework
• Read textbook 1, Chapter 29

• Read textbook 2, Chapter 17

• Work on Exercise 1.

You might also like