Data Warehouse Concepts: TCS Internal

Here are some possible interview questions.
1. What is source qualifier?

2. Difference between DSS & OLTP?
3. Explain grouped cross tab?
4. Hierarchy of DWH?
5. How many repositories can we create in Informatica?
6. What is surrogate key?
7. What is difference between Mapplet and reusable transformation?
8. What is aggregate awareness?
9. Explain reference cursor?
10. What are parallel querys and query hints?
11. DWH architecture?
12. What are cursors?
13. Advantages of de normalized data?
14. What is operational data source (ODS)?
15. What is meta data and system catalog?
16. What is factless fact schema?
17. What is confirmed dimension?
18. What is the capacity of power cube?
19. Difference between PowerPlay transformer and power play reports?
20. What is IQD file?
21. What is Cognos script editor?
22. What is difference macros and prompts?
23. What is power play plug in?
24. Which kind of index is preferred in DWH?
25. What is hash partition?
26. What is DTM session?
27. How can you define a transformation? What are different types of transformations in
Informatica?
28. What is mapplet?
29. What is query panel?
30. What is a look up function? What is default transformation for the look up function?
31. What is difference between a connected look up and unconnected look up?
32. What is staging area?
33. What is data merging, data cleansing and sampling?
34. What is up date strategy and what are th options for update strategy?
35. OLAP architecture?
36. What is subject area?
37. Why do we use DSS database for OLAP tools?
Data Warehouse Concepts
What is a Data Warehouse? According to Inmon, famous author for several data warehouse
books, "A data warehouse is a subject oriented, integrated, time variant, non volatile collection
of data in support of management's decision making process".
Example: In order to store data, over the years, many application designers in each branch
have made their individual decisions as to how an application and database should be built. So
source systems will be different in naming conventions, variable measurements, encoding
structures, and physical attributes of data. Consider a bank that has got several branches in
several countries, has millions of customers and the lines of business of the enterprise are
savings, and loans. The following example explains how the data is integrated from source
systems to target systems.
TCS Internal
Example of Source Data

System Name
Attribute Name
Column Name
Datatype
Values
Source System
1
Customer Application
Date
CUSTOMER_APPLICATION_DATE NUMERIC(8,0) 11012005
Source System
2
Customer Application
Date
CUST_APPLICATION_DATE
DATE
11012005
Source System
3
Application Date
APPLICATION_DATE
DATE
01NOV2005
In the aforementioned example, attribute name, column name, datatype and values are entirely
different from one source system to another. This inconsistency in data can be avoided by
integrating the data into a data warehouse with good standards.
Example of Target Data(Data Warehouse)
Target System
Attribute Name
Column Name
Datatype
Values
Record #1
Customer Application Date CUSTOMER_APPLICATION_DATE DATE
01112005
Record #2
01112005
Record #3
01112005
In the above example of target data, attribute names, column names, and datatypes are
consistent throughout the target system. This is how data from various source systems is
integrated and accurately stored into the data warehouse.
See Figure 1.12 below for Data Warehouse Architecture Diagram.
Figure 1.12 : Data Warehouse Architecture
Data Warehouse & Data Mart

A data warehouse is a relational/multidimensional database that is designed for query and
TCS Internal
analysis rather than transaction processing. A data warehouse usually contains historical data
that is derived from transaction data. It separates analysis workload from transaction workload
and enables a business to consolidate data from several sources.
In addition to a relational/multidimensional database, a data warehouse environment often
consists of an ETL solution, an OLAP engine, client analysis tools, and other applications that
manage the process of gathering data and delivering it to business users.
There are three types of data warehouses:
1. Enterprise Data Warehouse - An enterprise data warehouse provides a central database
for decision support throughout the enterprise.
2. ODS(Operational Data Store) - This has a broad enterprise wide scope, but unlike the real
entertprise data warehouse, data is refreshed in near real time and used for routine business
activity.
3. Data Mart - Datamart is a subset of data warehouse and it supports a particular region,
business unit or business function.
Data warehouses and data marts are built on dimensional data modeling where fact tables are
connected with dimension tables. This is most useful for users to access data since a database
can be visualized as a cube of several dimensions. A data warehouse provides an opportunity for
slicing and dicing that cube along each of its dimensions.
Data Mart: A data mart is a subset of data warehouse that is designed for a particular line of
business, such as sales, marketing, or finance. In a dependent data mart, data can be derived
from an enterprise-wide data warehouse. In an independent data mart, data can be collected
directly from sources.
Figure 1.12 : Data Warehouse and Datamarts
General Information
TCS Internal
In general, an organization is started to earn money by selling a product or by providing service

to the product. An organization may be at one place or may have several branches.
When we consider an example of an organization selling products throughtout the world, the
main four major dimensions are product, location, time and organization. Dimension tables have
been explained in detail under the section Dimensions. With this example, we will try to provide
detailed explanation about STAR SCHEMA.
What is Star Schema?
Star Schema is a relational database schema for representing multimensional data. It is the
simplest form of data warehouse schema that contains one or more dimensions and fact tables.
It is called a star schema because the entity-relationship diagram between dimensions and fact
tables resembles a star where one fact table is connected to multiple dimensions. The center of
the star schema consists of a large fact table and it points towards the dimension tables. The
advantage of star schema are slicing down, performance increase and easy understanding of
data.
Steps in designing Star Schema
Identify a business process for analysis(like sales).
Identify measures or facts (sales dollar).

Identify dimensions for facts(product dimension, location dimension, time dimension, organization
dimension).
List the columns that describe each dimension.(region name, branch name, region name).
Determine the lowest level of summary in a fact table(sales dollar).
Important aspects of Star Schema & Snow Flake Schema
In a star schema every dimension will have a primary key.

In a star schema, a dimension table will not have any parent table.
Whereas in a snow flake schema, a dimension table will have one or more parent tables.
Hierarchies for the dimensions are stored in the dimensional table itself in star schema.
Whereas hierachies are broken into separate tables in snow flake schema. These hierachies helps to
drill down the data from topmost hierachies to the lowermost hierarchies.
Glossary:
Hierarchy
A logical structure that uses ordered levels as a means of organizing data. A hierarchy can be
used to define data aggregation; for example, in a time dimension, a hierarchy might be used to
aggregate data from the Month level to the Quarter level, from the Quarter level to the Year
level. A hierarchy can also be used to define a navigational drill path, regardless of whether the
levels in the hierarchy represent aggregated totals or not.
Level
A position in a hierarchy. For example, a time dimension might have a hierarchy that represents
data at the Month, Quarter, and Year levels.
TCS Internal
Fact Table
A table in a star schema that contains facts and connected to dimensions. A fact table typically
has two types of columns: those that contain facts and those that are foreign keys to dimension
tables. The primary key of a fact table is usually a composite key that is made up of all of its
foreign keys.
A fact table might contain either detail level facts or facts that have been aggregated (fact tables
that contain aggregated facts are often instead called summary tables). A fact table usually
contains facts with the same level of aggregation.
Example of Star Schema: Figure 1.6
In the example figure 1.6, sales fact table is connected to dimensions location, product, time
and organization. It shows that data can be sliced across all dimensions and again it is possible
for the data to be aggregated across multiple dimensions. "Sales Dollar" in sales fact table can
be calculated across all dimensions independently or in a combined manner which is explained
below.
Sales Dollar value for a particular product

Sales Dollar value for a product in a location
Sales Dollar value for a product in a year within a location
Sales Dollar value for a product in a year within a location sold or serviced by an employee
Snow Flake Schema

A snowflake schema is a term that describes a star schema structure normalized through the
use of outrigger tables. i.e dimension table hierachies are broken into simpler tables. In star
schema example we had 4 dimensions like location, product, time, organization and a fact
table(sales).
In snow flake schema, the example diagram shown below has 4 dimension tables, 4 lookup
tables and 1 fact table. The reason is that hierarchies(category, branch, state, and month) are
being broken out of the dimension tables(PRODUCT, ORGANIZATION, LOCATION, and TIME)
TCS Internal
respectively and shown separately. In OLAP, this snow flake schema approach increases the
number of joins and poor performance in retrieval of data. In few organizations, they try to
normalize the dimension tables to save space. Since dimension tables hold less space, snow
flake schema approach may be avoided.
Example of Snow Flake Schema: Figure 1.7
Fact Table
The centralized table in a star schema is called as FACT table. A fact table typically has two types
of columns: those that contain facts and those that are foreign keys to dimension tables. The
primary key of a fact table is usually a composite key that is made up of all of its foreign keys.
In the example fig 1.6 "Sales Dollar" is a fact(measure) and it can be added across several
dimensions. Fact tables store different types of measures like additive, non additive and semi
additive measures.
Measure Types
Additive - Measures that can be added across all dimensions.

Non Additive - Measures that cannot be added across all dimensions.
Semi Additive - Measures that can be added across few dimensions and not with others.
A fact table might contain either detail level facts or facts that have been aggregated (fact tables
that contain aggregated facts are often instead called summary tables).
TCS Internal
In the real world, it is possible to have a fact table that contains no measures or facts. These
tables are called as Factless Fact tables.
Steps in designing Fact Table
Identify a business process for analysis(like sales).

Identify measures or facts (sales dollar).
Identify dimensions for facts(product dimension, location dimension, time dimension, organization
dimension).
List the columns that describe each dimension.(region name, branch name, region name).
Determine the lowest level of summary in a fact table(sales dollar).
Example of a Fact Table with an Additive Measure in Star Schema: Figure 1.6
In the example figure 1.6, sales fact table is connected to dimensions location, product, time
and organization. Measure "Sales Dollar" in sales fact table can be added across all dimensions
independently or in a combined manner which is explained below.
Sales Dollar value for a particular product

Sales Dollar value for a product in a location
Sales Dollar value for a product in a year within a location
Sales Dollar value for a product in a year within a location sold or serviced by an employee
What are ETL Tools?

ETL Tools are meant to extract, transform and load the data into Data Warehouse for decision
making. Before the evolution of ETL Tools, the above mentioned ETL process was done manually
by using SQL code created by programmers. This task was tedious and cumbersome in many
cases since it involved many resources, complex coding and more work hours. On top of it,
maintaining the code placed a great challenge among the programmers.
TCS Internal
These difficulties are eliminated by ETL Tools since they are very powerful and they offer many
advantages in all stages of ETL process starting from extraction, data cleansing, data profiling,
transformation, debuggging and loading into data warehouse when compared to the old method.
There are a number of ETL tools available in the market to do ETL process the data according to
business/technical requirements. Following are some those.
Popular ETL Tools
Tool Name
Company Name
Informatica
Informatica Corporation
DT/Studio
Embarcadero Technologies
DataStage
IBM
Ab Initio
Ab Initio Software Corporation
Data Junction
Pervasive Software
Oracle Warehouse Builder
Oracle Corporation
Microsoft SQL Server Integration Microsoft
ETL Concepts
Extraction, transformation, and loading. ETL refers to the methods involved in accessing and
manipulating source data and loading it into target database.
The first step in ETL process is mapping the data between source systems and target
database(data warehouse or data mart). The second step is cleansing of source data in staging
area. The third step is transforming cleansed source data and then loading into the target
system.
Note that ETT (extraction, transformation, transportation) and ETM (extraction, transformation,
move) are sometimes used instead of ETL.
Glossary of ETL (Reference:www.Oracle.com)
Source System
A database, application, file, or other storage facility from which the data in a data warehouse is
derived.
Mapping
The definition of the relationship and data flow between source and target objects.
Metadata
Data that describes data and other structures, such as objects, business rules, and processes.
For example, the schema design of a data warehouse is typically stored in a repository as
metadata, which is used to generate scripts used to build and populate the data warehouse. A
repository contains metadata.
Staging Area
A place where data is processed before entering the warehouse.
Cleansing
The process of resolving inconsistencies and fixing the anomalies in source data, typically as part
of the ETL process.
TCS Internal
Transformation
The process of manipulating data. Any manipulation beyond copying is a transformation.
Examples include cleansing, aggregating, and integrating data from multiple sources.
Transportation
The process of moving copied or transformed data from a source to a data warehouse.
Target System
A database, application, file, or other storage facility to which the "transformed source data" is
loaded in a data warehouse.
Figure 1.12 : Sample ETL Process Flow
ETL Tools - Informatica
TCS Internal
Informatica is a powerful ETL tool from Informatica Corporation, a leading provider of enterprise
data integration software and ETL softwares.
The important Informatica Components are:
Power Exchange
Power Center
Power Center Connect
Power Exchange
Power Channel
Metadata Exchange
Power Analyzer
Super Glue
In Informatica, all the Metadata information about source systems, target systems and
transformations are stored in the Informatica repository. Informatica's Power Center Client and
Repository Server access this repository to store and retrieve metadata.
Note: To know more about Metadata and its significance, please click here.
Source and Target:
Consider a Bank that has got many branches throughout the world. In each branch data may be
stored in different source systems like oracle, sql server, terradata, etc. When the Bank decides
to integrate its data from several sources for its management decisions, it may choose one or
more systems like oracle, sql server, terradata, etc. as its data warehouse target. Many
organisations prefer Informatica to do that ETL process, because Informatica is more powerful in
designing and building data warehouses. It can connect to several sources and targets to extract
meta data from sources and targets, transform and load the data into target systems.
Guidelines to work with Informatica Power Center
Repository: This is where all the metadata information is stored in the Informatica suite. The Power
Center Client and the Repository Server would access this repository to retrieve, store and manage
metadata.
Power Center Client: Informatica client is used for managing users, identifiying source and target
systems definitions, creating mapping and mapplets, creating sessions and run workflows etc.
Repository Server: This repository server takes care of all the connections between the repository
and the Power Center Client.
Power Center Server: Power Center server does the extraction from source and then loading data
into targets.
Designer: Source Analyzer, Mapping Designer and Warehouse Designer are tools reside within the
Designer wizard. Source Analyzer is used for extracting metadata from source systems.
Mapping Designer is used to create mapping between sources and targets. Mapping is a pictorial
representation about the flow of data from source to target.
Warehouse Designer is used for extracting metadata from target systems or metadata can be created
in the Designer itself.
TCS Internal
Data Cleansing: The PowerCenter's data cleansing technology improves data quality by validating,
correctly naming and standardization of address data. A person's address may not be same in all
source systems because of typos and postal code, city name may not match with address. These
errors can be corrected by using data cleansing process and standardized data can be loaded in target
systems (data warehouse).
Transformation: Transformations help to transform the source data according to the requirements of
target system. Sorting, Filtering, Aggregation, Joining are some of the examples of transformation.
Transformations ensure the quality of the data being loaded into target and this is done during the
mapping process from source to target.
Workflow Manager: Workflow helps to load the data from source to target in a sequential manner.
For example, if the fact tables are loaded before the lookup tables, then the target system will pop up
an error message since the fact table is violating the foreign key validation. To avoid this, workflows
can be created to ensure the correct flow of data from source to target.
Workflow Monitor: This monitor is helpful in monitoring and tracking the workflows created in each
Power Center Server.
Power Center Connect: This component helps to extract data and metadata from ERP systems like
IBM's MQSeries, Peoplesoft, SAP, Siebel etc. and other third party applications.
Power Center Exchange: This component helps to extract data and metadata from ERP systems like
IBM's MQSeries, Peoplesoft, SAP, Siebel etc. and other third party applications.
ETL Tools - Informatica

Power Exchange:
Informatica Power Exchange as a stand alone service or along with Power Center, helps
organizations leverage data by avoiding manual coding of data extraction programs. Power
Exchange supports batch, real time and changed data capture options in main frame(DB2,
VSAM, IMS etc.,), mid range (AS400 DB2 etc.,), and for relational databases (oracle, sql server,
db2 etc) and flat files in unix, linux and windows systems.
Power Channel:
This helps to transfer large amount of encrypted and compressed data over LAN, WAN, through
Firewalls, tranfer files over FTP, etc.
Meta Data Exchange:
Metadata Exchange enables organizations to take advantage of the time and effort already
invested in defining data structures within their IT environment when used with Power Center.
For example, an organization may be using data modeling tools, such as Erwin, Embarcadero,
Oracle designer, Sybase Power Designer etc for developing data models. Functional and technical
team should have spent much time and effort in creating the data model's data
structures(tables, columns, data types, procedures, functions, triggers etc). By using meta deta
exchange, these data structures can be imported into power center to identifiy source and target
mappings which leverages time and effort. There is no need for informatica developer to create
these data structures once again.
Power Analyzer:
Power Analyzer provides organizations with reporting facilities. PowerAnalyzer makes accessing,
TCS Internal
analyzing, and sharing enterprise data simple and easily available to decision makers.
PowerAnalyzer enables to gain insight into business processes and develop business intelligence.
With PowerAnalyzer, an organization can extract, filter, format, and analyze corporate
information from data stored in a data warehouse, data mart, operational data store, or
otherdata storage models. PowerAnalyzer is best with a dimensional data warehouse in a
relational database. It can also run reports on data in any table in a relational database that do
not conform to the dimensional model.
Super Glue:
Superglue is used for loading metadata in a centralized place from several sources. Reports can
be run against this superglue to analyze meta data.
Power Mart:
Power Mart is a departmental version of Informatica for building, deploying, and managing data
warehouses and data marts. Power center is used for corporate enterprise data warehouse and
power mart is used for departmental data warehouses like data marts. Power Center supports
global repositories and networked repositories and it can be connected to several sources. Power
Mart supports single repository and it can be connected to fewer sources when compared to
Power Center. Power Mart can extensibily grow to an enterprise implementation and it is easy for
developer productivity through a codeless environment.
Note:This is not a complete tutorial on Informatica. We will add more Tips and Guidelines on
Informatica in near future. Please visit us soon to check back. To know more about Informatica,
contact its official website www.informatica.com.
Informatica - Transformations
[Submitted by:Radhika, Michigan, US.]
In Informatica, Transformations help to transform the source data according to the requirements
of target system and it ensures the quality of the data being loaded into target.
Active Transformation
An active transformation can change the number of rows that pass through it from source to
target i.e it eliminates rows that do not meet the condition in transformation.
Passive Transformation
A passive transformation does not change the number of rows that pass through it i.e it passes
all rows through the transformation.
Transformations can be Connected or UnConnected.
Connected Transformation
Connected transformation is connected to other transformations or directly to target table in the
mapping.
UnConnected Transformation
An unconnected transformation is not connected to other transformations in the mapping. It is
called within another transformation, and returns a value to that transformation.
Following are the list of Transformations available in Informatica:
TCS Internal
Aggregator Transformation
Expression Transformation
Filter Transformation
Joiner Transformation
Lookup Transformation
Normalizer Transformation
Rank Transformation
Router Transformation
Sequence Generator Transformation
Stored Procedure Transformation
Sorter Transformation
Update Strategy Transformation
XML Source Qualifier Transformation
Advanced External Procedure Transformation
External Transformation
In the following pages, we will explain all the above Informatica Transformations and their
significances in the ETL process in detail.
Aggregator Transformation
Aggregator transformation is an Active and Connected transformation. This transformation is
useful to perform calculations such as averages and sums (mainly to perform calculations on
multiple rows or groups). For example, to calculate total of daily sales or to calculate average of
monthly or yearly sales. Aggregate functions such as AVG, FIRST, COUNT, PERCENTILE, MAX,
SUM etc. can be used in aggregate transformation.
Expression Transformation
Expression transformation is a Passive and Connected transformation. This can be used to
calculate values in a single row before writing to the target. For example, to calculate discount of
each product or to concatenate first and last names or to convert date to a string field.
Filter Transformation
Filter transformation is an Active and Connected transformation. This can be used to filter rows
in a mapping that do not meet the condition. For example, to know all the employees who are
working in Department 10 or to find out the products that falls between the rate category $500
and $1000.
Joiner Transformation
Joiner Transformation is an Active and Connected transformation. This can be used to join two
sources coming from two different locations or from same location. For example, to join a flat file
and a relational source or to join two flat files or to join a relational source and a XML source.
In order to join two sources, there must be atleast one matching port. at least one matching
port. While joining two sources it is a must to specify one source as master and the other as
detail.
The Joiner transformation supports the following types of joins:
TCS Internal
Normal
Master Outer
Detail Outer
Full Outer
Normal join discards all the rows of data from the master and detail source that do not match,
based on the condition.
Master outer join discards all the unmatched rows from the master source and keeps all the
rows from the detail source and the matching rows from the master source.
Detail outer join keeps all rows of data from the master source and the matching rows from
the detail source. It discards the unmatched rows from the detail source.
Full outer join keeps all rows of data from both the master and detail sources.
Lookup Transformation
Lookup transformation is Passive and it can be both Connected and UnConnected as well. It is
used to look up data in a relational table, view, or synonym. Lookup definition can be imported
either from source or from target tables.
For example, if we want to retrieve all the sales of a product with an ID 10 and assume that the
sales data resides in another table. Here instead of using the sales table as one more source,
use Lookup transformation to lookup the data for the product, with ID 10 in sales table.
Difference between Connected and UnConnected Lookup Transformation:
Connected lookup receives input values directly from mapping pipeline whereas UnConnected
lookup receives values from: LKP expression from another transformation.
Connected lookup returns multiple columns from the same row whereas UnConnected lookup
has one return port and returns one column from each row.
Connected lookup supports user-defined default values whereas UnConnected lookup does not
support user defined values.
Normalizer Transformation
Normalizer Transformation is an Active and Connected transformation. It is used mainly with
COBOL sources where most of the time data is stored in de-normalized format. Also, Normalizer
transformation can be used to create multiple rows from a single row of data.
Rank Transformation
Rank transformation is an Active and Connected transformation. It is used to select the top or
bottom rank of data. For example, to select top 10 Regions where the sales volume was very
high or to select 10 lowest priced products.
Router Transformation
Router is an Active and Connected transformation. It is similar to filter transformation. The only
difference is, filter transformation drops the data that do not meet the condition whereas router
has an option to capture the data that do not meet the condition. It is useful to test multiple
conditions. It has input, output and default groups. For example, if we want to filter data like
where State=Michigan, State=California, State=New York and all other States. Its easy to route
data to different tables.
TCS Internal
Sequence Generator Transformation

Sequence Generator transformation is a Passive and Connected transformation. It is used to
create unique primary key values or cycle through a sequential range of numbers or to replace
missing keys.
It has two output ports to connect transformations. By default it has two fields CURRVAL and
NEXTVAL(You cannot add ports to this transformation). NEXTVAL port generates a sequence of
numbers by connecting it to a transformation or target. CURRVAL is the NEXTVAL value plus one
or NEXTVAL plus the Increment By value.
Stored Procedure Transformation
Stored Procedure transformation is a Passive and Connected & UnConnected transformation. It is
useful to automate time-consuming tasks and it is also used in error handling, to drop and
recreate indexes and to determine the space in database, a specialized calculation etc.
The stored procedure must exist in the database before creating a Stored Procedure
transformation, and the stored procedure can exist in a source, target, or any database with a
valid connection to the Informatica Server. Stored Procedure is an executable script with SQL
statements and control statements, user-defined variables and conditional statements.
Sorter Transformation
Sorter transformation is a Connected and an Active transformation. It allows to sort data either
in ascending or descending order according to a specified field. Also used to configure for casesensitive sorting, and specify whether the output rows should be distinct.
Source Qualifier Transformation
Source Qualifier transformation is an Active and Connected transformation. When adding a
relational or a flat file source definition to a mapping, it is must to connect it to a Source
Qualifier transformation. The Source Qualifier performs the various tasks such as overriding
default SQL query, filtering records; join data from two or more tables etc.
Update Strategy Transformation
Update strategy transformation is an Active and Connected transformation. It is used to update
data in target table, either to maintain history of data or recent changes. You can specify how to
treat source rows in table, insert, update, delete or data driven.
XML Source Qualifier Transformation
XML Source Qualifier is a Passive and Connected transformation. XML Source Qualifier is used
only with an XML source definition. It represents the data elements that the Informatica Server
reads when it executes a session with XML sources.
Advanced External Procedure Transformation
Advanced External Procedure transformation is an Active and Connected transformation. It
operates in conjunction with procedures, which are created outside of the Designer interface to
extend PowerCenter/PowerMart functionality. It is useful in creating external transformation
applications, such as sorting and aggregation, which require all input rows to be processed
before emitting any output rows.
TCS Internal
External Procedure Transformation

External Procedure transformation is an Active and Connected/UnConnected transformations.
Sometimes, the standard transformations such as Expression transformation may not provide
the functionality that you want. In such cases External procedure is useful to develop complex
functions within a dynamic link library (DLL) or UNIX shared library, instead of creating the
necessary Expression transformations in a mapping.
Differences between Advanced External Procedure and External Procedure
Transformations:
External Procedure returns single value, where as Advanced External Procedure returns
multiple values.
External Procedure supports COM and Informatica procedures where as AEP supports only
Informatica Procedures.
Note:This is not a complete tutorial on Informatica. To know more about Informatica, contact its
official website www.informatica.com.
Database - RDBMS
There are a number of relational databases to store data. A relational database contains
normalized data stored in tables. Tables contain records and columns. RDBMS makes it easy to
work with individual records. Each row contains a unique instance of data for the categories
defined by the columns.
RDBMS are used in OLTP applications(e.g. ATM cards) very frequently and sometimes
datawarehouse may also use relational databases. Please refer to Relational data modeling for
details to know how data from a source system is normalized and stored in RDBMS databases.
Popular RDBMS Databases
RDBMS Name
Company Name
Oracle
Oracle Corporation
IBM DB2 UDB
IBM Corporation
IBM Informix
IBM Corporation
Microsoft SQL Server Microsoft

Sybase
Sybase Corporation
Terradata
NCR
Slowly Changing Dimensions

Dimensions that change over time are called Slowly Changing Dimensions. For instance, a
product price changes over time; People change their names for some reason; Country and
State names may change over time. These are a few examples of Slowly Changing Dimensions
since some changes are happening to them over a period of time.
Slowly Changing Dimensions are often categorized into three types namely Type1, Type2 and
Type3. The following section deals with how to capture and handling these changes over time.
The "Product" table mentioned below contains a product named, Product1 with Product ID being
the primary key. In the year 2004, the price of Product1 was $150 and over the time, Product1's
price changes from $150 to $350. With this information, let us explain the three types of Slowly
Changing Dimensions.
TCS Internal
Product Price in 2004:

Product ID(PK) Year Product Name Product Price
1
2004 Product1
$150
Type 1: Overwriting the old values.

In the year 2005, if the price of the product changes to $250, then the old values of the columns
"Year" and "Product Price" have to be updated and replaced with the new values. In this Type 1,
there is no way to find out the old value of the product "Product1" in year 2004 since the table
now contains only the new price and year information.
Product
1
2005 Product1
$250
Type 2: Creating an another additional record.

In this Type 2, the old values will not be replaced but a new row containing the new values will
be added to the product table. So at any point of time, the difference between the old values
and new values can be retrieved and easily be compared. This would be very useful for reporting
purposes.
Product
1
1
2004 Product1
2005 Product1
$150
$250
The problem with the above mentioned data structure is "Product ID" cannot store duplicate
values of "Product1" since "Product ID" is the primary key. Also, the current data structure
doesn't clearly specify the effective date and expiry date of Product1 like when the change to its
price happened. So, it would be better to change the current data structure to overcome the
above primary key violation.
Product
Product ID(PK)
1
1
Effective
Expiry
Year Product Name Product Price
DateTime(PK)
DateTime
01-01-2004 12.00PM 2004 Product1
$150
12-31-2004 11.59PM
01-01-2005 12.00PM 2005 Product1
$250
In the changed Product table's Data structure, "Product ID" and "Effective DateTime" are
composite primary keys. So there would be no violation of primary key constraint. Addition of
new columns, "Effective DateTime" and "Expiry DateTime" provides the information about the
product's effective date and expiry date which adds more clarity and enhances the scope of this
table. Type2 approach may need additional space in the data base, since for every changed
record, an additional row has to be stored. Since dimensions are not that big in the real world,
additional space is negligible.
Type 3: Creating new fields.
In this Type 3, the latest update to the changed values can be seen. Example mentioned below
illustrates how to add new columns and keep track of the changes. From that, we are able to see
the current price and the previous price of the product, Product1.
Product
TCS Internal
Current Product
Product ID(PK)
1
Year
2005
Current
Product Price
Name
Product1 $250
Old Product
Price
$150
Old Year
$2004
The problem with the Type 3 approach is over years, if the product price continuously changes,
then the complete history may not be stored, only the latest change will be stored. For example,
in year 2006, if the product1's price changes to $350, then we would not be able to see the
complete history of 2004 prices, since the old values would have been updated with 2005
product information.
Product
Product Product Old Product
Product ID(PK) Year
Name
Price
2006 Product1 $350
Price
$250
Old Year
$2005
Business Intelligence Tools

Business Intelligence Tools help to gather, store, access and analyze corporate data to aid in
decision-making. Generally these systems will illustrate business intelligence in the areas of
customer profiling, customer support, market research, market segmentation, product
profitability, statistical analysis, inventory and distribution analysis.
With Business Intelligence Tools, various data like customer related, product related, sales
related, time related, location related, employee related etc. are gathered and analysed based
on which important strategies or rules are formed and goals to achieve their target are set.
These decisions are very efficient and effective in promoting an Organisation's growth.
Since the collected data can be sliced across almost all the dimensions like time,location,
product, promotion etc., valuable statistics like sales profit in one region for the current year can
be calculated and compared with the previous year statistics.
Popular Business Intelligence Tools
Tool Name
Company Name
Business Objects
Business Objects
Cognos
Cognos
Hyperion
Hyperion
Microstrategy
Microstrategy
Microsoft Reporting Services Microsoft

Crystal
Business Objects
OLAP & its Hybrids

OLAP, an acronym for Online Analytical Processing is an approach that helps organization to
take advantages of DATA. Popular OLAP tools are Cognos, Business Objects, Micro Strategy etc.
OLAP cubes provide the insight into data and helps the topmost executives of an organization to
take decisions in an efficient manner.
Technically, OLAP cube allows one to analyze data across multiple dimensions by providing
multidimensional view of aggregated, grouped data. With OLAP reports, the major categories
like fiscal periods, sales region, products, employee, promotion related to the product can be
ANALYZED very efficiently, effectively and responsively. OLAP applications include sales and
TCS Internal
customer analysis, budgeting, marketing analysis, production analysis, profitability analysis and
forecasting etc.
ROLAP
ROLAP stands for Relational Online Analytical Process that provides multidimensional analysis of
data, stored in a Relational database(RDBMS).
MOLAP
MOLAP(Multidimensional OLAP), provides the analysis of data stored in a multi-dimensional data
cube.
HOLAP
HOLAP(Hybrid OLAP) a combination of both ROLAP and MOLAP can provide multidimensional
analysis simultaneously of data stored in a multidimensional database and in a relational
database(RDBMS).
DOLAP
DOLAP (Desktop OLAP or Database OLAP) provide multidimensional analysis locally in the client
machine on the data collected from relational or multidimensional database servers.
TCS Internal

Data Warehouse Concepts: TCS Internal

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Warehouse Concepts: TCS Internal

Uploaded by

Copyright:

Available Formats

Here are some possible interview questions.

1. What is source qualifier?

Example of Source Data

CUSTOMER_APPLICATION_DATE NUMERIC(8,0) 11012005

Customer Application Date CUSTOMER_APPLICATION_DATE DATE

Customer Application Date CUSTOMER_APPLICATION_DATE DATE

Customer Application Date CUSTOMER_APPLICATION_DATE DATE

Figure 1.12 : Data Warehouse Architecture

Data Warehouse & Data Mart

In general, an organization is started to earn money by selling a product or by providing service

Identify measures or facts (sales dollar).

Important aspects of Star Schema & Snow Flake Schema

In a star schema every dimension will have a primary key.

Sales Dollar value for a particular product

Snow Flake Schema

Additive - Measures that can be added across all dimensions.

Identify a business process for analysis(like sales).

Sales Dollar value for a particular product

What are ETL Tools?

Ab Initio Software Corporation

Oracle Warehouse Builder

Microsoft SQL Server Integration Microsoft

Figure 1.12 : Sample ETL Process Flow

ETL Tools - Informatica

ETL Tools - Informatica

Sequence Generator Transformation

External Procedure Transformation

IBM DB2 UDB

Microsoft SQL Server Microsoft

Slowly Changing Dimensions

Product Price in 2004:

Type 1: Overwriting the old values.

Type 2: Creating an another additional record.

Business Intelligence Tools

Microsoft Reporting Services Microsoft

OLAP & its Hybrids

You might also like