Professional Documents
Culture Documents
2) We use this data for query & analysis purpose based on the analysis results we will make
the business decisions
Not only that but the business execs can query the data themselves with little or no support from
IT—saving more time and more money. That means the business users won’t have to wait until IT
gets around to generating the reports, and those hardworking folks in IT can do what they do
best—keep the business running.
1)As per ralph Kimball approach first we 1)As per WH Inmon approach first we 4. What
define the data marts and we should define data warehouse and we should are
define data warehouse. define data marts
2) In this approach data marts are called 2) In this approach data marts are called
independent data marts dependent data marts
Characteristics of DWH?
1)Time variant: All the data in data warehouse identified with particular time period. The data of
DWH must be historical
2) Non-volatile: Data in DWH are never over written or deleted once committed. The data is static. It
is used for read only data we can’t change. This data used for future reporting.
3) Subject oriented: In this we collect the data from different subject area. The collected data must
be business oriented.
Ex: Sales, accounts, HR etc.
4) Integrated: The DWH contains data gathered from all organizations and merged into a coherent.
1) It is design to support the operational data 1) It is design to support the decision making
monitoring. process.
3) A Data Mart is a subset of data from a Data 3) A Data Warehouse is simply an integrated
Warehouse. Data Marts are built for specific user consolidation of data from a variety of sources
groups. that is specially designed to support strategic
and tactical decision making.
4) By providing decision makers with only a 4) The main objective of Data Warehouse is to
subset of data from the Data Warehouse, provide an integrated environment and coherent
Privacy, Performance and Clarity Objectives can
point in time.
be attained.
Facts: A business performance measurement tropically numeric and additive. It is stored in the fact
table. Facts are also called KPI (key performance indicator)
Or
Ex: Store key, Customer Key, Orders, The number of products sold, the value of products sold, the
products produced
Customer
Name State
Key
After Christina moved from Illinois to California, the new information replaces the new
record, and we have the following table:
Customer
Name State
Key
Advantages:
- This is the easiest way to handle the Slowly Changing Dimension problem, since there is
no need to keep track of the old information.
Disadvantages:
All history is lost. By applying this methodology, it is not possible to trace back in history.
For example, in this case, the company would not be able to know that Christina lived in
Illinois before. Usage:
Type 1 slowly changing dimension should be used when it is not necessary for the data
warehouse to keep track of historical changes.
2) Type2:
In case of type2 we will maintain historical data.
Whenever there is a change in address will insert new address in address table and
end dated the existing record.
To maintain the versioning in SCD type2 we use a) start date, end date b) flag c)
Address key or version number
Customer
Name State
Key
After Christina moved from Illinois to California, we add the new information as a new row
into the table:
Customer
Name State
Key
Advantages:
Disadvantages:
5
- This will cause the size of the table to grow fast. In cases where the number of rows for
the table is very high to start with, storage and performance can become a concern.
Usage:
Type 2 slowly changing dimension should be used when it is necessary for the data
warehouse to track historical changes.
3) Type3: In case of scd type3 we will maintain the current record and previous record. It
means one customer can have maximum two records.
Customer
Name State
Key
To accommodate Type 3 Slowly Changing Dimension, we will now have the following
columns:
Customer Key
Name
Original State
Current State
Effective Date
After Christina moved from Illinois to California, the original information gets updated, and
we have the following table (assuming the effective date of change is January 15, 2003):
15-JAN-
1001 Christina Illinois California
2003
Advantages:
- This does not increase the size of the table, since new information is updated.
Disadvantages:
6
Type 3 will not be able to keep all history where an attribute is changed more than once.
For example, if Christina later moves to Texas on December 15, 2003, the California
information will be lost.
Usage:
Type III slowly changing dimension should only be used when it is necessary for the data
warehouse to track historical changes, and when such changes will only occur for a finite
number of time.
A Dimension which is used two or more fact tables is called conformed dimension.
By nature it is a dimension but it will be located in fact table. Degenerated dimension contains only a
key and no attributes.
A junk dimension is collection of random transactional codes, flags and/or text attributes that are
unrelated to any particular dimension.
If a record occurs more than one time in a table by the difference of a non-key attribute then such
dimension is called dirty dimension.
For example A "dirty dimension" is one in which data quality cannot be guaranteed. For example, in
most banks, account-oriented source applications contain data about the same customer multiple
times. Many banks attempt to derive a "customer" by matching names and addresses across account
applications, but this process results in more than one entry for each bank customer. Similarly,
different attributes must be held for each of a bank's heterogeneous products. Attributes that are
meaningful for a loan, such as term, credit risk assessment, and collateral, have no meaning for
savings, checking, or investment products.
Dimension will pay multiple roles in the fact table like orders, sales, shipment date, Invoice date etc.
In data warehousing some-times we receive fact data first and later we receive dimension data this
kind of dimension is called lately arrived dimension.
7
Refer Q8.
Refer Q8
Refer Q8.
1) Additive : These are facts on which we can perform arthematic operation like addition on the
fact table
Ex: If we want find yearly sales we can sum quarter sales
2) Semi additive : These are fact we can perform sum operation for some of the dimension not to
others
Ex: In bank customer account we can find out monthly balance using credits and debits happen
in the particular month. We can’t find monthly balances using daily transactions.
3) Non-additive: These are facts we can’t be summed up for any dimension present in the fact
table
Ex: Temperature, Ratio’s etc.
Fact less fact table captures many to many relation between dimensions but contains with-out any
measures
Identifying product promotion events (to determine promoted products that didn’t sell)
No attribute is specified.
At this level, the data modeller attempts to identify the highest-level relationships among the
different entities.
Foreign keys (keys identifying the relationship between different entities) are specified.
At this level, the data modeller attempts to describe the data in as much detail as possible,
without regard to how they will be physically implemented in the database.
komPhysical considerations may cause the physical data model to be quite different from
the logical data model.
this level, the data modeller will specify how the logical data model will be realized in the database
schema.
Refer Q21.
The star schema is the simplest data warehouse Snowflake schema is a more complex data
scheme. warehouse model than a star schema.
9
In star schema each of the dimensions is In snow flake schema at least one hierarchy
represented in a single table .It should not have should exists between dimension tables.
any hierarc.lhies between dims.
It contains a fact table surrounded by dimension It contains a fact table surrounded by dimension
tables. If the dimensions are de-normalized, we tables. If a dimension is normalized, we say it is a
say it is a star schema design. snow flaked design.
In star schema only one join establishes the In snow flake schema since there is relationship
relationship between the fact table and any one between the dimensions tables it has to do many
of the dimension tables. So query performance is joins to fetch the data. So query performance is
good poor
It occupies more space compare snowflake Snowflake schema occupies less space when
schema because it is DE normalized compare to star schema because it is normalized
In star schema there will be less number of dimension tables so less number of joins so query
performance is good.
Snow flake design is normalized because snow flake design split one dimension table into multiple
dimensions to achieve the dimension.
26. What is the difference between the full load and incremental load?
First time in the data warehouse we load all Capture the changed data to load DWH we
the data from source into DWH is called full use incremental load
load
Full load we prefer load data first time Second time onwards we perform
incremental load in DWH
Capture the changed data in source then we load changed data to DWH.
10
28. What is surrogate key? Give an example where did you use it in your project?
It is the system generated sequence number used for maintaining the unique ness or primary key
In case of SCD Type2 to maintain the versioning number we use surrogate key .
If target and0 source databases are different and target table volume is high it contains some
millions of records in this scenario without staging table we need to design your Informatica using
look up to find out whether the record exists or not in the target table since target has huge volumes
so its costly to create cache it will hit the performance.
If we create staging tables in the target database we can simply do outer join in the source qualifier
to determine insert/update this approach will give you good performance.
We can create indexes in the staging state, to perform our source qualifier best.
If we have the staging area no need to relay on the informatics transformation to known
whether the record exists or not.
Weeding out unnecessary or unwanted things (characters and spaces etc.) from incoming data to
make it more meaningful and informative
31. What is the difference between business key and Surrogate key?
I want to manage the patient information In case of SCD Type2 to maintain the
using business key will add the patient_id
11
and patient_code this is different for each versioning number we use surrogate key.
hospital
Aggregations need to account for the additive nature of the measures, created on-the-fly or by
pre-aggregation
Common aggregations = Sum, Count, Distinct Count, Max, Min, Average, etc.
The granularity is the lowest level of information stored in the fact table. The depth of data
level is known as granularity. In date dimension the level could be year, month, quarter,
period, week, day of granularity.
It will not store any data It will store summary or aggregated data
It is used for security purpose and projection It is used for reporting purpose
12
We can perform DML operation on simple In this we can’t perform DML operations
views
View will fetch the data from base table it It will have less number of records compare
will run the base query to base tables performance will be improved
in the reports
There is no refresh clause in the views We can refresh the materialized views with
refresh clause
It is Normalized schema
3 or 4 or 5 TB’s
B-Tree:
Bit Map:
My understanding says that the dimensions should be extract first and fact should be extracted. That
way the foreign keys still be honoured in the staging area.
41. What happens if we try to load the fact tables before loading the dimension tables?
If we try to load the fact tables before loading the dimension table’s data will be rejected.