Data Warehousing Interview Questions

15.10.
2022, 11:58 Data Warehousing Interview Questions | by Shobha Bhagwat | Medium
To make Medium work, we log user data. Get started

Open in app
By using Medium, you agree to our
Privacy Policy, including cookie policy.
Shobha Bhagwat Follow
Dec 31, 2021 · 8 min read · Listen
Save
Data Warehousing Interview Questions

DWH topics to prepare for Data Engineering Interviews
https://www.researchgate.net/figure/Data-warehouse-architecture_fig1_275068752
I worked as a data engineer for almost 8 years and during that span, attended
interviews where I was asked many data warehousing (DWH) concepts. Hence I
thought to jot down some of the DWH concepts that are important from data
engineering (DE) interview perspective, based on my interview experience in the
past. To aid in DE interview prep, I have listed few high difficulty SQL questions in
my previous post.
https://shobha-bhagwat.medium.com/data-warehousing-interview-questions-ab96c242f19c 1/9
15.10.2022, 11:58 Data Warehousing Interview Questions | by Shobha Bhagwat | Medium
A. Data Warehousing definition and characteristics —

Open in app
A DWH is a copy of transactional data you
By using Medium, specifically
agree to our structured for query and analysis.
Another definition is thatPrivacy

a DWH Policy, including cookie policy.
is a subject-oriented, integrated, time-variant and
non-volatile collection of data to support management’s decision making process.
1. Subject-oriented — A DWH can be used to analyze a particular subject area (E.g.,

sales)
2. Integrated — DWH integrates data from different sources
3. Time variant — Historical data is stored in DWH as against transactional systems

where only most recent data is kept.
4. Non-volatile — Once data is in DWH, it can’t change
DWH is useful for reporting, analysis, data mining, exploration of historical data
etc. A datamart is a subset of a data warehouse focused on a particular line of
business, department, or subject area.
B. Approaches for implementing DWH —

1. Top-down approach — In this type of implementation, data from various
sources is loaded into DWH. After that subsets of data are extracted and loaded
into respective datamarts. The advantage of this approach is that the DWH is
single source of truth for all datamarts and thus it is easier to implement
consistency & standardization. Also it is easy to build new datamarts quickly
from the DWH for specific use cases. The downside of this approach is that it
may cost more and take more time for implementation than the other
approaches, since the first step involves setting up and building the entire DWH.
2. Bottom-up approach — In this approach, data from various sources is first

loaded into specific datamarts. Data from various datamarts is then combined
using conformed dimensions (i.e. dimensions which are shared by many facts
in the different datamarts). The advantage is the speed of delivery since the
DWH can be quickly integrated once the datamarts are built. The disadvantage
of this approach is that all datamarts should use standard dimensions and facts
to ensure integration and deliver a single version of the truth.
3. Hybrid approach — this approach is a blend of top-down & bottom-up

approaches and tries to capture both their advantages. The DWH is initially
designed in third normal form and the first few important datamarts are
designed in third normal form and loaded with data. Once these datamarts are
To make Medium work, we log user data.
Open in app Get started
loaded and data is approved by business,
By using Medium, you agreethe DWH is backfilled from these
to our
datamarts thus reducing implementation time.
4. Federated approach — this approach consists of integrating various DWH and

datamarts to immediately meet ever-changing business needs, without any
specific emphasis on any standard structure.
C. Types of DWH —
1. Global warehouse (Enterprise DWH) — DWH is centralized and designed for
entire enterprise. This type of DWH is time-consuming and costly to implement.
2. Independent datamart architecture — stand-alone datamarts, that are located

under one infrastructure (database), are controlled by different teams/work
groups and built solely for their needs.
3. Inter-connected DWH-datamart architecture — data marts are interconnected to

provide enterprise-wide view.
D. Facts —
Facts are quantitative measures, metrics or facts of the business (e.g. — sales,
revenue etc.). The different types of facts stored in DWH are —
1. Additive — facts that can be summed up through all the dimensions in the fact
table (e.g. sales fact)
2. Semi-additive — facts that can be summed up for some of the dimensions in the
fact table but not for others (e.g. daily balances fact can be summed up through
customer dimension but not through time dimension)
3. Non-additive — facts that can’t be summed up for any of the dimensions present
in the fact table (e.g. facts which are in ratios, percentage form)
4. Factless fact table — fact table that contains no measure or facts but can be used
to get aggregated counts. (e.g. fact table containing only sale_date and
product_key doesn’t contain any measure per se but can be used to get count of
products sold over various periods of time).
Fact tables that contain aggregated facts are called summary tables.
E. Dimensions —
Dimensions are attributes about facts. The various types of dimensions are-
1. Conformed dimension — dimension
Privacy which
Policy, including cookiemeans
policy. the same thing with every
fact across the DWH (e.g. customer_id, load_date)
2. Junk dimension — collection of random attributes that are unrelated to any

particular fact. (e.g. a junk dimension table that has all possible combinations of
age & marital status with a code associated with each combination. Thus instead
of two columns only one column will be required to convey the same
information)
3. Degenerated dimension — dimension which is derived from a fact table and

doesn’t have its own dimension table. (e.g. table containing only invoice number
or ticket number etc.)
4. Role-playing dimension — dimension which is often used for multiple purpose

within same database (e.g. date dimension can be used as both trade date as
well as settlement date)
5. Inferred dimension — while loading a fact record, the corresponding dimension

record may not be ready. One solution is to generate a surrogate dimension key
with null for all other attributes. This is called inferred dimension.
6. Static dimension — static dimension is not extracted from original data source
but is created within the context of the DWH and can be loaded manually. (e.g.
status codes)
7. Shrunk dimension — shrunk dimension is a subset of a dimension’s attributes

that apply to a higher level of summary. (e.g. — a month dimension would be a
shrunk dimension for date dimension)
Slowly changing dimensions (SCD) store both current and historical values for an
attribute. Since the dimensions don’t change frequently (as compared to facts), SCD
implementation deals with how history is maintained for dimensions (e.g. change
in marital status). The various types of SCD are —
1. Type I — No history is maintained, new information overwrites existing

information
2. Type II — New record is inserted for every changed dimension value

Open in app
One way to implement Type II SCD is to have start and end dates to indicate which is the latest value
Another way to implement Type II SCD is to have a flag to indicate which is the latest value
3. Type III — Original record is modified to include a new column which contains
previous value. Only current and previous value is maintained.
Only current and previous value (limited history) is maintained in SCD Type III
F. Normalization —
Database normalization is the process of structuring a database in accordance with
a series of rules in order to reduce data redundancy and improve data integrity. The
various normal forms are —
1. One normal form (1NF) — Eliminate repeated groups. Make a separate table for
each set of related attributes and give each table a primary key (PK). Each field
contains at most one value.
Original data

Open in app
After 1NF normalization
2. Two normal form (2NF) — Eliminate redundant data. If an attribute depends on

only part of a multi-valued key, remove it to a separate table.
Original table
Two tables created post 2NF normalization
3. Three normal form (3NF) — Eliminate columns not dependent on the primary
key.
Original table structure
Table structure after 3NF normalization
Transitive functional dependency — When changing a non-key column might cause

any of the other non-key By
column to change.
using Medium, (e.g.
you agree changing the name in a table might
to our
change the salutation column value)
G. DWH Modelling Schemas —

Schemas are logical representation of how facts and dimensions are linked in the
DWH.
1. Star schema — It is called a star schema because the entity-relationship diagram

of this schema resembles a star, with points radiating from a central fact table.
The fact table is connected to multiple dimension tables through primary key-
foreign key relationship. The dimension tables are not connected to each other.
This type of architecture is more suited for metrics/facts-heavy analysis. This
schema works mainly for de-normalized data.
2. Snowflake schema — This is an extension of star schema where the dimension

tables are normalized into multiple tables. Thus there is relationship between
the dimension tables. Hence data retrieval is slower. This structure is more
suited for dimension-heavy analysis.
3. Galaxy schema — This is a combination of star and snowflake schema. Here

many facts and some common dimensions are linked together.
H. OLAP vs OLTP systems —

Online transaction processing system (OLTP) processes and stores data from
transactions in real time. The tables here don’t maintain history, are smaller in size
and are highly normalized for quick insert, delete and update operations. These are
the backend tables of online applications.
Online Analytical Processing (OLAP) systems are de-normalized and process data in
batches. They store history and are used to run complex queries for reporting and
analysis. The tables here are de-normalized to support complex querying and data
analysis.
Another common question encountered in interviews is designing of data marts/

DWH for specific use cases. I was once asked to design the data model for a redBus
kind of bus seat booking To

system, in a FAANG interview. How to solve these type of
make Medium work, we log user data.
questions is detailed in this medium
By using article
Medium, I found
you agree to our online.
Some of the other questions that I have encountered are —
1. how to load incremental data (maintaining time-based watermarks, checking

source data changes and calculating incremental delta for loading, etc.),
2. how to load large files into a DWH through batch system (common data
cleaning steps, checking data integrity, null value handling, date formatting,
staging data and then moving to master table based on specific DML strategy
and maintaining PK- FK relationship while loading data)
3. various ETL strategies (for source file placed at ftp server, sourcing data from
transactional tables without locking, duplicate data handling etc.)
4. various data archival and staging strategies for different use cases
76 3
5. how to handle DDL operations while maintaining model integrity
There are many more questions and topics under data warehousing and data
modelling. I have covered only those ones which have been asked to me over the
years. Hope this is helpful!
About Help Terms Privacy
Get the Medium app

Data Warehousing Interview Questions - by Shobha Bhagwat - Medium

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Warehousing Interview Questions - by Shobha Bhagwat - Medium

Uploaded by

Copyright:

Available Formats

15.10.

2022, 11:58 Data Warehousing Interview Questions | by Shobha Bhagwat | Medium

To make Medium work, we log user data. Get started

Shobha Bhagwat Follow

Dec 31, 2021 · 8 min read · Listen

A. Data Warehousing definition and characteristics —

Another definition is thatPrivacy

1. Subject-oriented — A DWH can be used to analyze a particular subject area (E.g.,

2. Integrated — DWH integrates data from different sources

3. Time variant — Historical data is stored in DWH as against transactional systems

4. Non-volatile — Once data is in DWH, it can’t change

B. Approaches for implementing DWH —

2. Bottom-up approach — In this approach, data from various sources is first

3. Hybrid approach — this approach is a blend of top-down & bottom-up

4. Federated approach — this approach consists of integrating various DWH and

2. Independent datamart architecture — stand-alone datamarts, that are located

3. Inter-connected DWH-datamart architecture — data marts are interconnected to

fact across the DWH (e.g. customer_id, load_date)

2. Junk dimension — collection of random attributes that are unrelated to any

3. Degenerated dimension — dimension which is derived from a fact table and

4. Role-playing dimension — dimension which is often used for multiple purpose

5. Inferred dimension — while loading a fact record, the corresponding dimension

7. Shrunk dimension — shrunk dimension is a subset of a dimension’s attributes

1. Type I — No history is maintained, new information overwrites existing

2. Type II — New record is inserted for every changed dimension value

To make Medium work, we log user data. Get started

To make Medium work, we log user data. Get started

After 1NF normalization

2. Two normal form (2NF) — Eliminate redundant data. If an attribute depends on

Two tables created post 2NF normalization

Original table structure

Table structure after 3NF normalization

Transitive functional dependency — When changing a non-key column might cause

G. DWH Modelling Schemas —

1. Star schema — It is called a star schema because the entity-relationship diagram

2. Snowflake schema — This is an extension of star schema where the dimension

3. Galaxy schema — This is a combination of star and snowflake schema. Here

H. OLAP vs OLTP systems —

Another common question encountered in interviews is designing of data marts/

kind of bus seat booking To

Some of the other questions that I have encountered are —

1. how to load incremental data (maintaining time-based watermarks, checking

About Help Terms Privacy

Get the Medium app

You might also like