You are on page 1of 12

1

DATA WARE HOSE FUNDAMENTALS:

1. What is Data warehouse?

1)It is a database which stores the historical data

2) We use this data for query & analysis purpose based on the analysis results we will make
the business decisions

3) We call DWH as decision supporting system

4) The data that is available in DWH is read only data

2. Why do we use DWH?

a)A Data Warehouse Delivers Enhanced Business Intelligence


By providing data from various sources, managers and executives will no longer need to make
business decisions based on limited data or their gut. In addition, “data warehouses and related
BI can be applied directly to business processes including marketing segmentation, inventory
management, financial management, and sales.”

b) A Data Warehouse Saves Time


Since business users can quickly access critical data from a number of sources—all in one place—
they can rapidly make informed decisions on key initiatives. They won’t waste precious time
retrieving data from multiple sources.

Not only that but the business execs can query the data themselves with little or no support from
IT—saving more time and more money. That means the business users won’t have to wait until IT
gets around to generating the reports, and those hardworking folks in IT can do what they do
best—keep the business running.

c) A Data Warehouse Enhances Data Quality and Consistency


A data warehouse implementation includes the conversion of  data from numerous source systems 
into a common format.  Since each data from the various departments is standardized, each
department will produce results that are in line with all the other departments. So you can have
more confidence in the accuracy of your data. And accurate data is the basis for strong business
decisions.

d) A Data Warehouse Provides Historical Intelligence


A data warehouse stores large amounts of historical data so you can analyze different time periods
and trends in order to make future predictions. Such data typically cannot be stored in a
transactional database or used to generate reports from a transactional system.

e) A Data Warehouse Generates a High ROI


2

ROI retun oninvestment


Finally, the piece de resistance—return on investment. Companies that have implemented data
warehouses and complementary BI systems have generated more revenue and saved more
money than companies that haven’t invested in BI systems and data warehouses.
3.What is the Difference between Ralph Kimball and WH Inmon Approach?
Ralph Kimball WH Inmon

1)As per ralph Kimball approach first we 1)As per WH Inmon approach first we 4. What
define the data marts and we should define data warehouse and we should are
define data warehouse. define data marts

2) In this approach data marts are called 2) In this approach data marts are called
independent data marts dependent data marts

3) This approach we called bottom up 3) this approach we called as top down


approach approach

Characteristics of DWH?

1)Time variant: All the data in data warehouse identified with particular time period. The data of
DWH must be historical

Ex: Day, Week, Month etc.

2) Non-volatile: Data in DWH are never over written or deleted once committed. The data is static. It
is used for read only data we can’t change. This data used for future reporting.

3) Subject oriented: In this we collect the data from different subject area. The collected data must
be business oriented.
Ex: Sales, accounts, HR etc.

4) Integrated: The DWH contains data gathered from all organizations and merged into a coherent.

5. What is the difference between ODS and DWH

ODS(Operational Data Store) DWH

1) It is design to support the operational data 1) It is design to support the decision making
monitoring. process.

2) Data is volatile 2) Data is non-volatile

3) It contains current data 3) It contains historical data

4) It contains detailed data 4) It contains summary data

5) It contains less volume of data 5) It contains high volume of data

6. What is the difference between Data Mart and DWH?


3

Data Mart Data Warehouse

1) Data mart is usually sponsored at the 1) Data warehouse is a “Subject-Oriented,


department level and developed with a specific Integrated, Time-Variant, Non-volatile collection
issue or subject in mind, a data mart is a data of data in support of decision making”.
warehouse with a focused objective.
2) A data mart is used on a business division/ 2) A data warehouse is used on an enterprise
department level. level

3) A Data Mart is a subset of data from a Data 3) A Data Warehouse is simply an integrated
Warehouse. Data Marts are built for specific user consolidation of data from a variety of sources
groups. that is specially designed to support strategic
and tactical decision making.
4) By providing decision makers with only a 4) The main objective of Data Warehouse is to
subset of data from the Data Warehouse, provide an integrated environment and coherent
Privacy, Performance and Clarity Objectives can
point in time.
be attained.

7. What are dimensions and Facts? Give the example?

Dimensions:It is structural or characteristic attribute. A dimension is essentially an entry point for


getting at the facts. Dimensions are things of interests to the business

Ex: Customer, Product, Store, Country, State etc.

Facts: A business performance measurement tropically numeric and additive. It is stored in the fact
table. Facts are also called KPI (key performance indicator)

Or

A fact is a numeric value that a business wishes to count or sum

Ex: Store key, Customer Key, Orders, The number of products sold, the value of products sold, the
products produced

8. What are different types of dimensions?

1) Slowly changed dimension: It will change with respect to time.


Ex: Customer address etc.
It is three types
1) Type 1: In case of type 1 we will not maintain any history. We will maintain only current
record by updating existing record. Technically type1 is insert else update.

In our example, recall we originally have the following table:

Customer
Name State
Key

1001 Christina Illinois


4

After Christina moved from Illinois to California, the new information replaces the new
record, and we have the following table:

Customer
Name State
Key

1001 Christina California

Advantages:

- This is the easiest way to handle the Slowly Changing Dimension problem, since there is
no need to keep track of the old information.

Disadvantages:

All history is lost. By applying this methodology, it is not possible to trace back in history.
For example, in this case, the company would not be able to know that Christina lived in
Illinois before. Usage:

About 50% of the time.

When to use Type 1:

Type 1 slowly changing dimension should be used when it is not necessary for the data
warehouse to keep track of historical changes.

2) Type2:
 In case of type2 we will maintain historical data.
 Whenever there is a change in address will insert new address in address table and
end dated the existing record.
 To maintain the versioning in SCD type2 we use a) start date, end date b) flag c)
Address key or version number

In our example, recall we originally have the following table:

Customer
Name State
Key

1001 Christina Illinois

After Christina moved from Illinois to California, we add the new information as a new row
into the table:

Customer
Name State
Key

1001 Christina Illinois

1005 Christina California

Advantages:

- This allows us to accurately keep all historical information.

Disadvantages:
5

- This will cause the size of the table to grow fast. In cases where the number of rows for
the table is very high to start with, storage and performance can become a concern.

- This necessarily complicates the ETL process.

Usage:

About 50% of the time.

When to use Type 2:

Type 2 slowly changing dimension should be used when it is necessary for the data
warehouse to track historical changes.

3) Type3: In case of scd type3 we will maintain the current record and previous record. It
means one customer can have maximum two records.

In our example, recall we originally have the following table:

Customer
Name State
Key

1001 Christina Illinois

To accommodate Type 3 Slowly Changing Dimension, we will now have the following
columns:

 Customer Key

 Name

 Original State

 Current State

 Effective Date

After Christina moved from Illinois to California, the original information gets updated, and
we have the following table (assuming the effective date of change is January 15, 2003):

Customer Original Current Effective


Name
Key State State Date

15-JAN-
1001 Christina Illinois California
2003

Advantages:

- This does not increase the size of the table, since new information is updated.

- This allows us to keep some part of history.

Disadvantages:
6

Type 3 will not be able to keep all history where an attribute is changed more than once.
For example, if Christina later moves to Texas on December 15, 2003, the California
information will be lost.

Usage:

Type 3 is rarely used in actual practice.

When to use Type 3:

Type III slowly changing dimension should only be used when it is necessary for the data
warehouse to track historical changes, and when such changes will only occur for a finite
number of time.

9. What is Confirmed dimension? Give example?

A Dimension which is used two or more fact tables is called conformed dimension.

Ex: Time, Product, Country etc.

10. What is Degenerated Dimension? Give example?

By nature it is a dimension but it will be located in fact table. Degenerated dimension contains only a
key and no attributes.

Ex: Bill number, Receipt number etc.

11. What is Junk Dimension? Give example?

A junk dimension is collection of random transactional codes, flags and/or text attributes that are
unrelated to any particular dimension.

Ex: Yes or No, Male or Female etc.

12. What is Dirty Dimension? Give example?

If a record occurs more than one time in a table by the difference of a non-key attribute then such
dimension is called dirty dimension.

For example A "dirty dimension" is one in which data quality cannot be guaranteed. For example, in
most banks, account-oriented source applications contain data about the same customer multiple
times. Many banks attempt to derive a "customer" by matching names and addresses across account
applications, but this process results in more than one entry for each bank customer. Similarly,
different attributes must be held for each of a bank's heterogeneous products. Attributes that are
meaningful for a loan, such as term, credit risk assessment, and collateral, have no meaning for
savings, checking, or investment products.

13. What is Role play dimension?

Dimension will pay multiple roles in the fact table like orders, sales, shipment date, Invoice date etc.

Ex: Time dimension etc.

14. What is lately arrived Dimension?

In data warehousing some-times we receive fact data first and later we receive dimension data this
kind of dimension is called lately arrived dimension.
7

15. What is rapidly changing dimension?

It is dimension that changes rapidly with respect to time period.

Ex: share price etc.

16. What are different types of slowly changing dimensions?

Refer Q8.

17. What is the difference between type1, type2 and type3?

Refer Q8

18. How do we maintain the versioning in type2 dimension?

Refer Q8.

19. What are different types of Facts? Give Examples?

These are three types

1) Additive : These are facts on which we can perform arthematic operation like addition on the
fact table
Ex: If we want find yearly sales we can sum quarter sales
2) Semi additive : These are fact we can perform sum operation for some of the dimension not to
others
Ex: In bank customer account we can find out monthly balance using credits and debits happen
in the particular month. We can’t find monthly balances using daily transactions.
3) Non-additive: These are facts we can’t be summed up for any dimension present in the fact
table
Ex: Temperature, Ratio’s etc.

20. What is fact less fact table? Give the example?

Fact less fact table captures many to many relation between dimensions but contains with-out any
measures

Common examples of fact less fact tables include:

 Identifying product promotion events (to determine promoted products that didn’t sell)

 Tracking student attendance or registration events

 Tracking insurance-related accident events

21. What are steps involved in dimensional modelling?

Dimensional modelling involved in three steps:

Conceptual Data Model

Features of conceptual data model include:

 Includes the important entities and the relationships among them.


8

 No attribute is specified.

 No primary key is specified.

At this level, the data modeller attempts to identify the highest-level relationships among the
different entities.

Logical Data Model

Features of logical data model include:

 Includes all entities and relationships among them.

 All attributes for each entity are specified.

 The primary key for each entity specified.

 Foreign keys (keys identifying the relationship between different entities) are specified.

 Normalization occurs at this level.

At this level, the data modeller attempts to describe the data in as much detail as possible,
without regard to how they will be physically implemented in the database.

Physical Data Model

Features of physical data model include:

 Specification all tables and columns.

 Foreign keys are used to identify relationships between tables.

 Demoralization may occur based on user requirements.

 komPhysical considerations may cause the physical data model to be quite different from
the logical data model.

this level, the data modeller will specify how the logical data model will be realized in the database
schema.

22. What is the difference between logical and physical modelling?

Refer Q21.

23. What is difference between start schema and snowflake?

Star Schema Snow Flake Schema

The star schema is the simplest data warehouse Snowflake schema is a more complex data
scheme. warehouse model than a star schema.
9

In star schema each of the dimensions is In snow flake schema at least one hierarchy
represented in a single table .It should not have should exists between dimension tables.
any hierarc.lhies between dims.

It contains a fact table surrounded by dimension It contains a fact table surrounded by dimension
tables. If the dimensions are de-normalized, we tables. If a dimension is normalized, we say it is a
say it is a star schema design. snow flaked design.
In star schema only one join establishes the In snow flake schema since there is relationship
relationship between the fact table and any one between the dimensions tables it has to do many
of the dimension tables. So query performance is joins to fetch the data. So query performance is
good poor

A star schema optimizes the performance by Snowflake schemas normalize dimensions to


keeping queries simple and providing fast eliminated redundancy. The result is more
response time. All the information about the complex queries and reduced query
each level is stored in one row. performance.
It is called a star schema because the diagram It is called a snowflake schema because the
resembles a star. diagram resembles a snowflake.

It occupies more space compare snowflake Snowflake schema occupies less space when
schema because it is DE normalized compare to star schema because it is normalized

24. Why query performance is better in Start Schema?

In star schema there will be less number of dimension tables so less number of joins so query
performance is good.

25. Which design Start Schema or snow flake is normalized one?

Snow flake design is normalized because snow flake design split one dimension table into multiple
dimensions to achieve the dimension.

26. What is the difference between the full load and incremental load?

Full load Incremental load

First time in the data warehouse we load all Capture the changed data to load DWH we
the data from source into DWH is called full use incremental load
load

Full load we prefer load data first time Second time onwards we perform
incremental load in DWH

By performing the incremental load we are


trying to load less number of data to
improve performance

27. How can we do the incremental load from source?

Capture the changed data in source then we load changed data to DWH.
10

28. What is surrogate key? Give an example where did you use it in your project?

It is the system generated sequence number used for maintaining the unique ness or primary key

In case of SCD Type2 to maintain the versioning number we use surrogate key .

29. What is Staging area? Why do we use staging area?

If target and0 source databases are different and target table volume is high it contains some
millions of records in this scenario without staging table we need to design your Informatica using
look up to find out whether the record exists or not in the target table since target has huge volumes
so its costly to create cache it will hit the performance.

If we create staging tables in the target database we can simply do outer join in the source qualifier
to determine insert/update this approach will give you good performance.

It will avoid full table scan to determine insert/updates on target.


And also we can create index on staging tables since these tables were designed for specific
application it will not impact to any other schemas/users.

While processing flat files to data warehousing we can perform cleansing.


Data cleansing, also known as data scrubbing, is the process of ensuring that a set of data is correct
and accurate. During data cleansing, records are checked for accuracy and consistency.

 Since it is one-to-one mapping from ODS to staging we do truncate and reload.

 We can create indexes in the staging state, to perform our source qualifier best.

 If we have the staging area no need to relay on the informatics transformation to known
whether the record exists or not.

30. What is data cleaning?

Weeding out unnecessary or unwanted things (characters and spaces etc.) from incoming data to
make it more meaningful and informative

31. What is the difference between business key and Surrogate key?

Business Key Surrogate key

A business key or natural key is an index It is the system generated sequence


which identifies unique ness of a row number used for maintaining the unique
based on column that exists naturally in a ness or primary key
table according to business rules

Business are a good way of avoiding It is alternative of business key. Surrogate


duplicate records so in practice I typically key are most efficient then business key so I
add business key as an alternative key in a would always recommended joining a fact
dimension table to a dimension table using surrogate
key

I want to manage the patient information In case of SCD Type2 to maintain the
using business key will add the patient_id
11

and patient_code this is different for each versioning number we use surrogate key.
hospital

32. What is normalization? Why do we normalize the data?

1) Spiting of one table into multiple table is called normalization.

2) It is used to eliminate the data redundancy.

3) It also used to eliminate the duplicates

33. What are aggregate tables?

Aggregation is a summarization of base-level fact table records

Aggregations need to account for the additive nature of the measures, created on-the-fly or by
pre-aggregation

Common aggregation scenarios

a. Category product aggs by store by day

b. District store aggs by product by day

c. Monthly sales aggs by product by store

d. Category product aggs by store district by day

Common aggregations = Sum, Count, Distinct Count, Max, Min, Average, etc.

34. What is granularity?

The granularity is the lowest level of information stored in the fact table. The depth of data
level is known as granularity. In date dimension the level could be year, month, quarter,
period, week, day of granularity.

The process consists of the following two steps:


- Determining the dimensions that are to be included
- Determining the location to place the hierarchy of each dimension of information

35. What is the difference between view and materialized view?

View Materialized View

View is a logical object It is physical object

It will not store any data It will store summary or aggregated data

It is used for security purpose and projection It is used for reporting purpose
12

We can perform DML operation on simple In this we can’t perform DML operations
views

View will fetch the data from base table it It will have less number of records compare
will run the base query to base tables performance will be improved
in the reports

There is no refresh clause in the views We can refresh the materialized views with
refresh clause

If we create view on single table we can call


as simple view and we create view on more
tables it is called as complex view

36. Explain a situation where snowflake is better than start schema?

Snowflake is occupies less amount of space when compare to star schema.

It is Normalized schema

37. What are different architectures of data warehouse?

38. What is the size of your DWH?

3 or 4 or 5 TB’s

39. What type of indexes normally used in DWH?

In data warehouse normally we use B-tree and Bit Map indexes

B-Tree:

1) It will create on the columns with highest cardinality columns

2) It is default index created by oracle database.

Bit Map:

1) It will create on lowest cardinality columns.

40. What is loading order of dimension and facts?

My understanding says that the dimensions should be extract first and fact should be extracted. That
way the foreign keys still be honoured in the staging area.

41. What happens if we try to load the fact tables before loading the dimension tables?

If we try to load the fact tables before loading the dimension table’s data will be rejected.

42. What is a Schema?

Graphical Representation of the data structure is called as schema.

You might also like