You are on page 1of 6

Data Warehouse

A data warehouse is a relational database that is designed for query and business analysis
rather than for transaction processing. It contains historical data derived from transaction data.
This historical data is used by the business analysts to understand about the business in detail.
A data warehouse should have the following characteristics
Subject oriented: A data that gives information about particular subject. For example, to know
about a company's sales, a data warehouse needs to build on sales data. Using this data
warehouse we can find the last year sales. This ability to define a data warehouse by subject
(sales) makes it a subject oriented. For example, "sales" can be a particular subject.
Integrated: Bringing data from different sources and putting them in to a consistent format. This
includes resolving the units of measures, naming conflicts etc.
data warehouse integrates data from multiple data sources. For example, source A and source B
may have different ways of identifying a product, but in a data warehouse, there will be only a
single way of identifying a product.
Non-volatile: Once the data enters into the data warehouse, the data should not be updated.
Once data is in the data warehouse, it will not change. So, historical data in a data warehouse
should never be altered.
Time variant: all data in DW is identified with particular time period.
To analyze the business, analysts need large amounts of data. So, the data warehouse should
contain historical data.
Historical data is kept in a data warehouse. For example, one can retrieve data from 3 months, 6
months, 12 months, or even older data from a data warehouse. This contrasts with a transactions
system, where often only the most recent data is kept. For example, a transaction system may
hold the most recent address of a customer, where a data warehouse can hold all addresses
associated with a customer.

the grain of a fact table defines the level of detail that is stored, and which dimensions are
included make up this grain

Three-Tier Data Warehouse Architecture

Generally a data warehouses adopts a three-tier architecture. Following are the three tiers of the
data warehouse architecture.
Bottom Tier - The bottom tier of the architecture is the data warehouse database server. It is the
relational database system. We use the back end tools and utilities to feed data into the bottom
tier. These back end tools and utilities perform the Extract, Clean, Load, and refresh functions.
Middle Tier - In the middle tier, we have the OLAP Server that can be implemented in either of
the following ways.
By Relational OLAP (ROLAP), which is an extended relational database management system.
The ROLAP maps the operations on multidimensional data to standard relational operations.
By Multidimensional OLAP (MOLAP) model, which directly implements the multidimensional data
and operations.
Top-Tier - This tier is the front-end client layer. This layer holds the query tools and reporting
tools, analysis tools and data mining tools.
The following diagram depicts the three-tier architecture of data warehouse:

Data Warehouse Design Approaches

Data warehouse design is one of the key technique in building the data warehouse. Choosing a
right data warehouse design can save the project time and cost. Basically there are two data
warehouse design approaches are popular.
Bottom-Up Design:
In the bottom-up design approach, the data marts are created first to provide reporting capability.
A data mart addresses a single business area such as sales, Finance etc. These data marts are
then integrated to build a complete data warehouse. The integration of data marts is
implemented using data warehouse bus architecture. In the bus architecture, a dimension is
shared between facts in two or more data marts. These dimensions are called conformed
dimensions. These conformed dimensions are integrated from data marts and then data
warehouse is built.

Advantages of bottom-up design are:

This model contains consistent data marts and these data marts can be delivered quickly.
As the data marts are created first, reports can be generated quickly.
The data warehouse can be extended easily to accommodate new business units. It is
just creating new data marts and then integrating with other data marts.
Disadvantages of bottom-up design are:
The positions of the data warehouse and the data marts are reversed in the bottom-up
approach design.
Top-Down Design:
In the top-down design approach the, data warehouse is built first. The data marts are then
created from the data warehouse.

Advantages of top-down design are:

Provides consistent dimensional views of data across data marts, as all data marts are
loaded from the data warehouse.
This approach is robust against business changes. Creating a new data mart from the
data warehouse is very easy.
Disadvantages of top-down design are:
This methodology is inflexible to changing departmental needs during implementation
It represents a very large project and the cost of implementing the project is significant.

Data Warehouse Dimensional Modelling (Types of Schemas)

There are four types of schemas are available in data warehouse. Out of which the star schema
is mostly used in the data warehouse designs. The second mostly used data warehouse schema
is snow flake schema. We will see about these schemas in detail.
Star Schema: Good for DataWarehouse
A star schema is the one in which a central fact table is surrounded by denormalized
dimensional tables. A star schema can be simple or complex. A simple star schema consists of
one fact table where as a complex star schema have more than one fact table.

Snow Flake Schema: Good for Data Marts

A snow flake schema is an enhancement of star schema by adding additional dimensions. Snow
flake schema are useful when there are low cardinality (unique data values) attributes in the
dimensions. Dimensions are normalized.

the snowflake schema query is more complex than Star. Because the dimension tables are
normalized, we need to dig deeper to get the name of the product type and the city. We
have to add another JOIN for every new level inside the same dimension.

Galaxy Schema:
Galaxy schema contains many fact tables with some common dimensions (conformed
dimensions). This schema is a combination of many data marts.

Fact Constellation Schema:

The dimensions in this schema are segregated into independent dimensions based on the levels
of hierarchy. For example, if geography has five levels of hierarchy like teritary, region, country,
state and city; constellation schema would have five dimensions instead of one.

The fact table contains business facts (or measures), and foreign keys which refer to
candidate keys (normally primary keys) in the dimension tables. Contrary to fact
tables, dimension tables contain descriptive attributes (or fields) that are typically textual
fields (or discrete numbers that behave like text).
Dimension tables are used to describe dimensions; they contain dimension keys, values and
attributes. For example, the time dimension would contain every hour, day, week, month, quarter
and year that has occurred since you started your business operations. Product dimension could
contain a name and description of products you sell, their unit price, color, weight and other
attributes as applicable. Dimension tables are typically small, ranging from a few to several
thousand rows. Occasionally dimensions can grow fairly large.
Although there might be other attributes that you store in the relational database, data
warehouses might not need all of those attributes. For example, customer telephone numbers,
email addresses and other contact information would not be necessary for the warehouse. Keep
in mind that data warehouses are used to make strategic decisions by analyzing trends. It is not
meant to be a tool for daily business operations. On the other hand, you might have some reports
that do include data elements that aren't necessary for data analysis.
Fact tables contain keys to dimension tables as well as measurable facts that data analysts
would want to examine. For example, a store selling automotive parts might have a fact table
recording a sale of each item. The fact table of an educational entity could track credit hours
awarded to students. A bakery could have a fact table that records manufacturing of various
baked goods.
Fact tables can grow very large, with millions or even billions of rows. It is important to identify
the lowest level of facts that makes sense to analyze for your business this is often referred to as
fact table "grain". For instance, for a healthcare billing company it might be sufficient to track
revenues by month; daily and hourly data might not exist or might not be relevant.