Professional Documents
Culture Documents
New Questions
New Questions
I am not looking
for the definition, but the practical implementation e.g. table structure, ETL/loading. {L}
Answer: Create the dimension table as normal, i.e. first the dim key column as an
integer, then the attributes as varchar (or varchar2 if you use Oracle). Then Id create 3
additional columns: IsCurrent flag, Valid From and Valid To (they are datetime
columns). With regards to the ETL, Id check first if the row already exists by comparing
natural key. If it exists then expire the row and insert a new row. Set the Valid From
date to todays date or the current date time.
An experienced candidate (particularly DW ETL developer) will not set the Valid From
date to the current date time, but to the time when the ETL started. This is so that all
the rows in the same load will have the same Valid From, which is 1 millisecond after
the expiry time of the previous version thus avoiding issue with ETL workflows that run
across midnight.
Purpose: SCD 2 is the one of the first things that we learn in data warehousing. It is
considered the basic/fundamental. The purpose of this question is to separate the
quality candidate from the ones whos bluffing. If the candidate can not answer this
question you should worry.
2. Question: How do you index a fact table? And explain why. {H}
Answer: Index all the dim key columns, individually, non clustered (SQL Server) or
bitmap (Oracle). The dim key columns are used to join to the dimension tables, so if they
are indexed the join will be faster. An exceptional candidate will suggest 3 additional
things: a) index the fact key separately, b) consider creating a covering index in the right
order on the combination of dim keys, and c) if the fact table is partitioned the
partitioning key must be included in all indexes.
Purpose: Many people know data warehousing only in theory. This question is designed
to separate those who have actually built a warehouse and those who havent.
3. Question: In the source system, your customer record changes like this: customer1
and customer2 now becomes one company called customer99. Explain a) impact to the
customer dim (SCD1), b) impact to the fact tables. {M}
Answer: In the customer dim we update the customer1 row, changing it to customer99
(remember that it is SCD1). We do soft delete on the customer2 row by updating the
IsActive flag column (hard delete is not recommended). On the fact table we find the
Surrogate Key for customer1 and 2 and update it with customer99s SK.
Purpose: This is a common problem that everybody in data warehousing encounters. By
asking this question we will know if the candidate has enough experience in data
warehousing. If they have not come across this (probably they are new in DW), we want
to know if they have the capability to deal with it or not.
4. Question: What are the differences between Kimball approach and Inmons? Which
one is better and why? {L}
Answer: if you are looking for a junior role e.g. developer, then the expected answer is:
in Kimball we do dimension modelling, i.e. fact and dim tables where as in Inmon we do
CIF, i.e. EDW in normalised form and we then create a DM/DDS from the EDW. Junior
candidates usually prefer Kimball, because of query performance and flexibility, or
because thats the only one they know; which is fine. But if you are interviewing for a
senior role e.g. senior data architect then they need to say that the approach depends
on the situation. Both Kimball & Inmons approaches have advantages and
disadvantages. Some of the main reasons of having a normalised DW can be found
here.
Purpose: a) to see if the candidate understands the core principles of data warehousing
or they just know the skin, b) to find out if the candidate is open minded, i.e. the
solution depends on what we are trying to achieve (theres right or wrong answer) or if
they are blindly using Kimball for every situation.
5. Question: Suppose a fact row has unknown dim keys, do you load that row or not?
Can you explain the advantage/disadvantages? {M}
Answer: We need to load that row so that the total of the measure/fact is correct. To
enable us to load the row, we need to either set the unknown dim key to 0 or the dim
key of the newly created dim rows. We can also not load that row (so the total of the
measure will be different from the source system) if the business requirement prefer it.
In this case we load the fact row to a quarantine area complete with error processing,
DQ indicator and audit log. On the next day, after we receive the dim row, we load the
fact row. This is commonly known as Late Arriving Dimension Rows and there are many
sources for further information; one of the best is Bob Beckers article here in 2006.
Others refer to this as Early Arriving Fact Row, which Ralph Kimball explained here in
2004.
Purpose: again this is a common problem that we encounter in regular basis in data
warehousing. With this question we want to see if the candidates experience level is up
to the expectation or not.
6. Question: Please tell me your experience on your last 3 data warehouse projects.
What were your roles in those projects? What were the issues and how did you solve
them? {L}
Answer: Theres no wrong or right answer here. With this question you are looking for a)
whether they have done similar things to your current project, b) whether their have done
the same role as the role you are offering, c) whether they faces the same issues as
your current DW project.
Purpose: Some of the reasons why we pay more to certain candidates compared to the
others are: a) they have done it before they can deliver quicker than those who havent,
b) they come from our competitors so we would know whats happening there and we
can make a better system than theirs, c) they have solved similar issues so we could
borrow their techniques.
But it is possible to join 2 fact tables, using the common dim keys, but the performance
is usually horrible. For example: if FactTable1 has dim1key, dim2key, dimkey3 and
FactTable2 has dim1key and dim2key then join them like this:
1
2
3
4
5
6
7
8
So if we dont join 2 fact tables that way, how do we do it? The answer is fact key column. It is a good
practice (especially in SQL Server because of the concept of cluster index) to have fact key column to
enable us to identify rows on the fact table (see my article here). The performance would be better (than
joining on dim keys), but you need to plan this in advance as you need to include the fact key column on
the other fact table.
1
2
3
I implemented this technique originally for self joining, but then expand the use to join to other fact table.
But this must be use on an exception basis rather than the norm.
Purpose: not to trap the candidate of course. But to see if they have the experience dealing with a
problem which doesnt happen every day.
11. Question: How do you index a dimension table? {L}
Answer: clustered index on the dim key, and non clustered index (individual) on attribute columns which
are used on the query where clause.
Purpose: this question is critical to be asked if you are looking for a Data Warehouse Architect (DWA) or a
Data Architect (DA). Many DWA and DA only knows logical data model. Many of them dont know how to
index. They dont know how different the physical tables are in Oracle compared to in Teradata. This
question is not essential if you are looking for a report or ETL developer. Its good for them to know, but
its not essential.
12. Question: Tell me what you know about William Inmon? {L} Alternatively: Ralph Kimball.
Answer: He was the one who introduced the concept of data warehousing. Arguably Barry Devlin was the
first one, but hes not as popular as Inmon. If you ask who is Barry Devlin or who is Claudia Imhoff 99.9%
of the candidates wouldnt know. But every decent practitioner in data warehousing should know about
Inmon and Kimball.
Purpose: to test of the candidate is a decent practitioner in data warehousing or not. Youll be surprise
(especially if you are interviewing a report developer) how many candidates dont know the answer. If
someone is applying for a BI architect role and he never heard about Inmon you should worry.
13. Question: What is the difference between a data mart and a data warehouse? {L}
Answer: Most candidates will answer that one is big and the other is small. Some good candidates
(particularly Kimball practitioners) will say that data mart is one star. Whereas DW is a collection of all
stars. An excellent candidate will say all the above answers, plus they will say that a DW could be the
normalised model that store EDW, whereas DM is the dimensional model containing 1-4 stars for specific
department (both relational DB and multidimensional DB).
Purpose: The question has 3 different levels of answer, so we can see how deep the candidates
knowledge in data warehousing.
14. Question: What the purpose of having a multidimensional database?
Answer: Many candidates dont know what a multidimensional database (MDB) is. They have heard about
OLAP, but not MDB. So if the candidate looks puzzled, help them by saying an MDB is an OLAP
database. Many will say Oh I see but actually they are still puzzled so it will take a good few moments
before they are back to earth again. So ask again: What is the purpose of having an OLAP database? The
answer is performance and easier data exploration. An MDB (aka cube) is a hundred times faster than
relational DB for returning an aggregate. An MDB will be very easy to navigate, drilling up and down the
hierarchies and across attributes, exploring the data.
Purpose: This question is irrelevant to report or ETL developer, but a must for a cube developer and
DWA/DA. Every decent cube developer (SSAS, Hyperion, Cognos) should be able to answer the question
as its their bread and butter.