You are on page 1of 11

DATA WAREHOUSING

TThhee ggooaall iiss ttoo eennaabbllee uusseerrss ttoo mmaakkee iinnffoorrm meedd
ddeecciissiioonnss rraappiiddllyy ssoo tthheeiirr ccoom
mppaanniieess ccaann rreessppoonndd
ttoo m a k e ch a n g e a
make change and remain ccoom n d r e m a in mppeettiittiivvee..

DATA WAREHOUSE GROWTH

1. Starter 2
2. How to creating a Logical Design? 2
3. Physical Design in Data Warehouses 3
4. Physical Design Structures 4
4.1 Tablespaces 4
4.2 Tables and Partitioned Tables 5
4.3 Views 5
4.4 Integrity Constraints 5
4.5 Indexes and Partitioned Indexes 5
4.6 Dimensions 6

5. Deployment 6
5.1 Stage 1: Reporting 6
5.2 Stage 2: Analysis 6
5.3 Stage 3: Prediction 7
5.4 Stage 4: Operationalize 8
5.5 Stage 5: Activate 9

6. Managing Data Warehouses 10


SUSHIL KULKARNI 2

1. Starter

When the corporate decided to build a data warehouse, they have to define the
business requirements and the scope of the application and created a conceptual
design. Now one has to translate the requirements into a system deliverable. For doing
it one should create the logical and physical design for the data warehouse and define:

o The specific data content


o Relationships within and between groups of data
o The system environment supporting your data warehouse
o The data transformations required
o The frequency with which data is refreshed

The logical design is conceptual and abstract and defines the logical relationships
among the objects. On the other hand, the physical design contains the effective way
of storing and retrieving the objects as well as handling them from a transportation and
backup/recovery perspective.

2. How to creating a Logical Design?

A logical design is conceptual and abstract and we with defining the types of information
that we need.

Entity-relationship modeling technique can be used to model the corporate’s logical


information requirements. Entity-relationship modeling requires to identify the things of
importance called entities, the properties of these things called attributes, and how
they are related to one another called relationships.

The process of logical design involves arranging data into a series of logical relationships
called entities and attributes. An entity represents a chunk of information. In relational
databases, an entity often maps to a table. An attribute is a component of an entity that
helps define the uniqueness of the entity. In relational databases, an attribute maps to a
column.

For consistent database, we need an unique identifiers. A unique identifier is to be


added to tables so that one can differentiate between the same item when it appears in
different places. In a physical design, this is usually a primary key.

While entity-relationship diagram is associated with normalized models. The example


can be given as any OLTP applications. This technique is useful for data warehouse
design in the form of dimensional modeling.

In dimensional modeling, instead of seeking to discover atomic units of information


(such as entities and attributes) and all of the relationships between them, one has to
identify which information belongs to a central fact table and which information belongs
to its associated dimension tables.

sushiltry@yahoo.co.in
SUSHIL KULKARNI 3

Thus the logical design can be made using pen and paper and gives

(1) a set of entities and attributes corresponding to fact tables and dimension tables
(2) a model of operational data from source into subject-oriented information in target
data warehouse schema.

3. Physical Design in Data Warehouses

Now we will see the physical design of a data warehouse environment by moving from
logical design to physical design.

We have seen that the logical design is to draw with a pen and paper before building
data warehouse. The physical design is the creation of the database with SQL
statements.

During the physical design process, you convert the data gathered during the logical
design phase into a description of the physical database structure. Physical design
decisions are mainly driven by query performance and database maintenance aspects.

During the logical design phase, a model of data warehouse consists of entities,
attributes, and relationships. The entities are linked together using relationships.
Attributes are used to describe the entities. The unique identifier (UID) distinguishes
between one instance of an entity and another.

Following figure shows the different ways of thinking about logical and physical designs.

Logical Design Physical design


Entities Tables Indexes

Relationships Integrity
Constraints Materialized
Primary key View
Attributes Foreign Key
Not NULL

Unique
Identifiers Dimensions
Columns

During the physical design process, the expected schema is translated into actual
database structures by mapping

o Entities to tables
o Relationships to foreign key constraints

sushiltry@yahoo.co.in
SUSHIL KULKARNI 4

o Attributes to columns
o Primary unique identifiers to primary key constraints
o Unique identifiers to unique key constraints

4. Physical Design Structures

After converting logical design to a physical, we need to create following structures:

(i) Tablespaces (ii) Tables and Partitioned Tables


(iii) Views (iv) Integrity Constraints
(v) Dimensions

Some of these structures require disk space. Others exist only in the data dictionary. For
performance improvement the following structures may be created:

(i) Indexes and Partitioned Indexes (ii) Materialized Views

4.1 Tablespaces

A tablespace consists of one or more data files, which are physical structures within the
operating system. A datafile is associated with only one tablespace. From a design
perspective, tablespaces are containers for physical design structures.

Tablespaces need to be separated by differences. For example, tables should be


separated from their indexes and small tables should be separated from large tables.
During parallel processing or concurrent query access, all tablespaces accessed by
parallel operations should be striped. Striping means data from large table is divided
into small portions and stores them on separate datafiles on separate disks. Tablespaces
are always stripe over different CPUs.

Following figure shows that there are four CPUs, two controllers, and three devices
containing tablespaces.

Controller 1 Controller 2

1 1 1 1 Table space 1

2 2 2 2 Table space 2

3 3 3 3 Table space 3

sushiltry@yahoo.co.in
SUSHIL KULKARNI 5

4.2 Tables and Partitioned Tables

Tables are the basic unit of data storage. They are the container for the expected
amount of raw data in your data warehouse.

Using partitioned tables a large data volume can be decompose into smaller and more
manageable pieces. The main design criterion for partitioning is manageability, as well
as we get better performance benefits.

For example, we can choose a partitioning strategy based on a sales transaction date
and a monthly granularity. If we have four years' worth of data, then we can delete a
month's data as it becomes older than four years with a single, quick DDL statement
and load new data while only affecting 1/48th of the complete table. Business questions
regarding the last quarter will only affect three months, which is equivalent to three
partitions, or 3/48ths of the total volume.

Partitioning large tables improves performance because each partitioned piece is more
manageable. One can partition based on transaction dates in a data warehouse. For
example, each month, one month's worth of data can be assigned its own partition.

4.3 Views

A view is a presentation of the data contained in one or more tables or other views. A
view takes the output of a query and treats it as a table. Views do not require any space
in the database.

4.4 Integrity Constraints

Integrity constraints are used to enforce business rules associated with your database
and to prevent having invalid information in the tables. Integrity constraints in data
warehousing differ from constraints in OLTP environments. In OLTP environments, they
primarily prevent the insertion of invalid data into a record, which is not a big problem in
data warehousing environments because accuracy has already been guaranteed. In data
warehousing environments, constraints are only used for query rewrite. NOT NULL
constraints are particularly common in data warehouses. Under some specific
circumstances, constraints need space in the database. These constraints are in the
form of the underlying unique index.

4.5 Indexes and Partitioned Indexes

Indexes are optional structures associated with tables or clusters. In addition to the
classical B-tree indexes, bitmap indexes are very common in data warehousing
environments.

Bitmap indexes are optimized index structures for set-oriented operations. Additionally,
they are necessary for some optimized data access methods such as star
transformations.

sushiltry@yahoo.co.in
SUSHIL KULKARNI 6

Indexes are just like tables in that we can partition them, although the partitioning
strategy is not dependent upon the table structure. Partitioning indexes makes it easier
to manage the warehouse during refresh and improves query performance.

4.6 Dimensions

A dimension is a schema object that defines hierarchical relationships between attributes


or attribute sets. A hierarchical relationship is a functional dependency from one level of
a hierarchy to the next one. A dimension is a container of logical relationships and does
not require any space in the database. A typical dimension is city, state, region, and
country.

5. Deployment

The data warehouse deployments improves the execution of a business strategy in


addition to its development. This evolution increases the set of service levels upon the
data warehouse architect. Now we consider evolution of data warehousing through the
five stages that are most common in the maturation of decision support within an
organization.

5.1 Stage 1: Reporting

The initial stage of data warehouse deployment typically focuses on reporting from a
single source of truth within an organization. The data warehouse brings huge value
simply by integrating disparate sources of information within an organization into a
single repository to drive decision-making across functional and/or product boundaries.

For the most part, the questions in a reporting environment are known in advance.
Thus, database structures can be optimized to deliver good performance even when
queries require access to huge amounts of information.

The biggest challenge in Stage 1 data warehouse deployment is data integration. The
challenges in constructing a repository with consistent, cleansed data cannot be
overstated. There can easily be hundreds of data sources in a legacy computing
environment-each with a unique domain value standard and underlying implementation
technology. The hard work that goes into providing well-integrated information for
decision-makers becomes the foundation for all subsequent stages of data warehouse
deployment

5.2 Stage 2: Analysis

In a Stage 2 data warehouse deployment, decision-makers focus less on what happened


and more on why it happened. Analysis activities are concerned with drilling down
beneath the numbers on a report to slice and dice data at a detailed level.

Ad hoc analysis plays a big role in Stage 2 data warehouse implementations. Questions
against the database cannot be known in advance. Performance management relies a

sushiltry@yahoo.co.in
SUSHIL KULKARNI 7

lot more on advanced optimizer capability in the RDBMS because query structures are
not as predictable as they are in a pure reporting environment.

Performance is also a lot more important in a Stage 2 data warehouse implementation


because the information repository is used much more interactively. Whereas reports
are typically scheduled to run on a regular basis with business calendars as a driver for
timing, ad hoc analysis is fundamentally a hands-on activity with iterative refinement of
questions in an interactive environment. Business users require direct access to the data
warehouse via GUI tools without the need for programmer intermediaries. Support for
concurrent query execution and large numbers of users against the warehouse is typical
of a Stage 2 implementation.

Business users, however, are a very impatient bunch. Performance must provide
response times measured in seconds or a small number of minutes for drill-downs in an
OLAP (online analytical processing) environment. The database optimizer's ability to
determine efficient access paths, using indexing and sophisticated join techniques, plays
a critical role in allowing flexible access to information within acceptable response times.

5.3 Stage 3: Prediction

As an organization becomes well-entrenched in quantitative decision-making techniques


and experiences the value proposition for understanding the "whats" and "whys" of its
business dynamics, the next step is to leverage information for predictive purposes.

Understanding what will happen next in the business has huge implications for
proactively managing the strategy for an organization. Stage 3 data warehousing
requires data mining tools for building predictive models using historical detail.

The number of end users who will apply the advanced analytics involved in predictive
modeling is relatively small. However, the workloads associated with model construction
and scoring are intense. Model construction typically involves derivation of hundreds of
complex metrics for hundreds of thousands (or more) of observations as the basis for
training the predictive algorithms for a specific set of business objectives. Scoring is
frequently applied against a larger set (millions) of observations because the full
population is scored rather than the smaller training sets used in model construction.
Advanced data mining methods often employ complex mathematical functions such as
logarithms, exponentiation, trigonometric functions and sophisticated statistical functions
to obtain the predictive characteristics desired. Access to detailed data is essential to the
predictive power of the algorithms. Tools from vendors such as SAS and Quadstone
provide a framework for development of complex models and require direct access to
information stored in the relational structures of the data warehouse.

The business end users in the data mining space tend to be a relatively small group of
very sophisticated analysts with market research or statistical backgrounds. However,
beware in your capacity planning! This small quantity of end users can easily consume
50% or more of the machine cycles on the data warehouse platform during peak
periods. This heavy resource utilization is due to the complexity of data access and

sushiltry@yahoo.co.in
SUSHIL KULKARNI 8

volume of data handled in a typical data-mining environment.

5.4 Stage 4: Operationalize

Operationalization in Stage 4 of the evolution starts to bring us into the domain of active
data warehousing. Whereas stages 1 to 3 focus on strategic decision-making within an
organization, Stage 4 focuses on tactical decision support. Think of strategic decision
support as providing the information necessary to make long-term decisions in the
business. Applications of strategic decision support include market segmentation,
product (category) management strategies, profitability analysis, forecasting and many
others.

Tactical decision support is not focused on developing corporate strategy in the ivory
tower, but rather on supporting the people in the field who execute it.

Operationalization typically means providing access to information for immediate


decision-making in the field. Two examples are (1) inventory management with just-in-
time replenishment and (2) scheduling and routing for package delivery. Many retailers
are moving toward vendor managed inventory, with a retail chain and the
manufacturers that supply it working as partners. The goal is to reduce inventory costs
through more efficient supply chain management. In order for the partnership to be
successful, access to information regarding sales, promotions, inventory-on-hand, etc.
must be provided to the vendor at a detailed level. Manufacturing, delivery and so on
can then be executed efficiently based on inventory requirements on a per-store. To be
useful, the information must be extremely up-to-date and query response times must be
very fast.

In the example of package shipping with less than full load trucking there are very
complex decisions involved in how to schedule trucks and route packages. Trucks
generally converge at break bulks wherein packages get moved from one truck to
another so that they ultimately arrive at their desired destination (in a way very
analogous to how humans are shuffled around between connecting flights at an airline
hub). When packages are on a late-arriving truck, tough decisions need to get made in
regard to whether the connecting truck that the late package is scheduled for will wait
for the package or leave on time. If it leaves without the package, the service level on
that package may be compromised. On the other hand, waiting for the delayed package
may cause other packages that are ready to go to miss their service levels.
How long the truck should wait will depend on the service levels of all delayed packages
destined for the truck as well as service levels for those packages already on the truck.
A package due the next day is obviously going to have more difficulty in meeting its
service levels under conditions of delay that one that is not due until many days later.

Moreover, the sending and receiving parties associated with the package shipment
should also be considered. Higher priority on making service levels should be given to
packages associated with profitable customers where the relationship may be at risk if a
package is late. Alternative routing options for the late packages, weather conditions
and many other factors may also come into play. Making good decisions in this
environment amounts to a highly complex optimization problem.

sushiltry@yahoo.co.in
SUSHIL KULKARNI 9

It is clear that a break bulk manager will dramatically increase the quality of his or her
scheduling and routing decisions with the assistance of advanced decision support
capabilities. However, for these capabilities to be useful, the information to drive
decision-making must be extremely up-to-date. This means continuous data acquisition
into the data warehouse in order for the decision-making capabilities to be relevant to
day-to-day operations. Whereas a strategic decision support environment can use data
that is loaded once per month or once per week, this lack of data freshness is
unacceptable for tactical decision support. Furthermore, the response time for queries
must be measured in a small number of seconds in order to accommodate the realities
of decision-making in an operational, field environment.

5.5 Stage 5: Activate

The larger the role an active data warehouse plays in the operational aspects of decision
support, the more incentive the business has to automate the decision processes. Both
for efficiency reasons and for consistency in decision-making, the business will want to
automate decisions when humans do not add significant value.

In e-commerce business models there is no choice but to automate decision-making


when a customer interacts with a Web site. Interactive customer relationship
management (CRM) on a Web site or at an ATM is all about making decisions to
optimize the customer relationship through individualized product offers, pricing, content
delivery and so on. The very complex decision-making associated with interactive CRM
takes place without humans in a completely automated fashion and must be executed
with response times measured in seconds or milliseconds.

As technology evolves, more and more decisions become executed with event-driven
triggers to initiate fully automated decision processes. For example, the retail industry is
on the verge of a technology breakthrough in the form of electronic shelf labels. This
technology obsoletes the old-style Mylar labels, which require manual labor to update
prices by swapping small plastic numbers on a shelf label. The new electronic labels can
implement price changes remotely via computer controls without any manual labor.

Integration of the electronic shelf label technology with an active data warehouse
facilitates sophisticated price management with as much automation as a business cares
to deploy. For seasonal items in stores where inventories are higher than they ought to
be, it will be possible to automatically initiate sophisticated mark-down strategies to
drive maximum sell-through with minimum margin erosion. Whereas a sophisticated
mark-down strategy is prohibitively costly in the world of manual pricing, the use of
electronic shelf labels with promotional messaging and dynamic pricing opens a whole
new world of possibilities for price management. Moreover, the power of an active data
warehouse allows these decisions to be made in an optimal fashion on an item-by-item,
store-by-store and second-by-second basis using event triggering and sophisticated
decision support capability. In a CRM context, even customer-by-customer decisions are
possible with an active data warehouse.

Intense competition and technology innovations are motivating these advances in


decision support deployment. An active data warehouse delivers information and

sushiltry@yahoo.co.in
SUSHIL KULKARNI 10

enables decision support throughout an organization rather than being confined to


strategic decision-making processes. However, tactical decision support does not replace
strategic decision support. Rather, an active data warehouse supports the coexistence of
both types of workloads. Notice in Figure 2 that a significant amount of workload in a
Stage 5 data warehouse is still focused on strategic thinking. The operationalized and
event triggered decision support of stages 4 and 5 provide the execution capability for
strategies developed from traditional data warehouse analysis characterized in stages 1
to 3.

6. Managing Data Warehouses

Data Warehouses are not magic-they take a great deal of very hard work. In many
cases data warehouse projects are viewed as a stopgap measure to get users off our
backs or to provide something for nothing. But data warehouses require careful
management and marketing.

A data warehouse is a good investment only if end-users actually can get at vital
information faster and cheaper than they can use current technology. As a consequence,
management has to think seriously about how they want their warehouses to perform
and how they are going to get the word out to the end-user community. And
management has to recognize that the maintenance of the data warehouse structure is
as critical as the maintenance of any other mission-critical application. In fact,
experience has shown that data warehouses quickly become one of the most used
systems in any organization.

Management, especially IT management, must also understand that if they embark on a


data warehousing program, they are going to create new demands upon their
operational systems: demands for better data, demands for consistent data, demands
for different kinds of data.

Now we will discuss different tasks involved in maintaining of data warehouse, that will
suffices users’ needs or reflects changes to a database. For a database that does not
change except when the entire database is loaded with new data, maintenance tasks
are minimal. You need only ensure that enough space is available in the default and
named segments to accommodate the current batch of input data. Any needed
restoration of the database can be done from the original input data files.

For databases that are modified by incremental load operations or INSERT, UPDATE, or
DELETE statements, maintenance includes accommodating growth in the database, as
well as adjusting to changes in users' needs or the warehouse environment. For
databases that change, backing up the data regularly is an important part of
maintenance.

The data warehouse can support the following for maintenance:

o Locking Tables and Databases


o Obtaining Information on Tables and Indexes
o Monitoring Growth of Tables and Indexes

sushiltry@yahoo.co.in
SUSHIL KULKARNI 11

o How Space is Allocated to a Segment


o Altering Segments
o Maintaining a Time-Cyclic Database
o Recovering a Damaged Segment
o Managing Optical Storage
o Altering Tables
o Copying or Moving a Database
o Modifying the Configuration File
o Monitoring and Controlling a Database Server
o Determining Version Information
o Deleting Database Objects and Databases

eeaaaaee

sushiltry@yahoo.co.in

You might also like