Professional Documents
Culture Documents
Version 2.0
December 2007
This document, and information herein, are the exclusive property of Teradata Corporation and all unauthorized
use and reproduction is prohibited. Copyright (C) 2008 by Teradata Corporation, Dayton, Ohio, USA.
All rights reserved. Printed in Denmark. Confidential unpublished property of Teradata Corporation
Document Changes
Rev. Date Section Comment
1.0 Jan 2003 All Initial Issue – deployed with TSM 4.1
2.0 Dec 2007 All Teradata Branding
Trademarks
All trademarks and service marks mentioned in this document are marks of their respective
owners and are as such acknowledged by Teradata Corporation.
Control Information
Page 58 is the last page of this document. This document is under Revision Control.
Version 1.0
1. Introduction................................................................................................................1
1.1 Design considerations..........................................................................................1
1.2 Scope1
Appendix B. Sizing........................................................................................16
B.1 Erwin volumetric...............................................................................................16
iv
Version 1.0
Version 1.0
Version 1.0
1.2 Scope
As we are harvesting best practices from different sources we found it difficult to make a
document with a natural flow that reads easily from cover to cover. Instead we have
created just a very small main document and moved the harvested material to separate
appendices. We believe a PDD document is used for reference and not cover to cover
reading so we have made the small main body sort of a reading guide. We have
established a list of the main TSM tasks related to PDD and for each of those indicated
which appendices cover the subjects of the task. The small main body contains a frame of
reference for the appendices.
Design Considerations
Enterprise Data Application
Warehouse Solution
Flexibility Performance
Ease of Maintenance Optimization
• 3rd Normal Form • More Denormalized
• “Ad hoc” Query Centric • Star Schema
• Data Transformation • Snowflake Schema
• Reporting • “Canned” Query Centric
• Logical Data Model • Calculation / Process Centric
(LDM) Lead • Physical Data Model
(PDM) Lead
05/24/01 NCR Confidential 6
1
2 The process of Replicating and Propagating to a dependant datamart
Manual
Data
ETL
EDW
R&P
Datamart
AdHoc Std
OLAP Report
Report Querie
s
s s
Task Reference
Determine the design of databases and users Appendix A. Design of the Physical
(Teradata database objects) and their hierarchies. Database
Determine the databases used for tables, views, and
macros.
Determine how users access rights to views and
tables will be managed.
Design the DBA functions for changing database Appendix A. Design of the Physical
objects, users, granting access rights, monitoring Database
performance, monitoring table space, monitoring
spool space, and monitoring security.
Determine how privacy requirements will be met.
Determine the physical tables that will be Appendix B. PDM Fundamentals
implemented so that performance and availability
and
requirements are met for applications and ECTL.
Appendix D. CRM Denormalization
Document deviations from the LDM and the
Guidelines
reasons for a different physical design.
Determine the estimated data volume per table. Appendix B. Sizing
6
Database Usage
Database Usage
EDWadmin The top super user for administration of the EDW hierarchy
PP_Admins Administrative users responsible for administrating the database objects
and users
PA_Admins Administrative users responsible for the administration of the Teradata
database
PP_BackupUsers Users for running the BAR processes
PP_LoadUsers Users responsible for staging, transforming, and loading data
/********************************************************
* Create EDW databases, run as EDWadmin *
********************************************************/
create database DP_TAB from D_Production as perm = 0;
create database DP_UTL from D_Production as perm = 0;
create database DP_WRK from D_Production as perm = 0;
create database DP_VEW from D_Production as perm = 0;
create database DP_LOG from D_Production as perm = 0;
create database DP_MDB from D_Production as perm = 0;
10
12
grant insert, select, update, delete, execute on DP_MDB to all PP_RepUsers; /* Index might be
required */
/***********************************************************
* Create Access rights for Development, run as UD_Admin *
***********************************************************/
grant insert, select, update, delete, execute on DD_MDB to all PD_RepUsers; /* Index might be
required */
grant execute, select on DD_ TAB to all PD_RepUsers;
grant insert, select, execute on DD_LOG to all PD_RepUsers;
/***********************************************************
14 * Create Access rights for Test, run as UT_Admin *
grant insert, select, update, delete, execute on DT_MDB to all PT_RepUsers; /* Index might be
required */
grant execute, select on DT_TAB to all PT_RepUsers;
grant insert, select, execute on DT_LOG to all PT_RepUsers;
A.4 Recommendations
DBC and EDWadmin passwords has to be protected to avoid any unauthorized access to
the database. These passwords needs to be secured in an envelope and placed with a
defined mechanism to access only in a crisis.
It is recommended that EDWadmin and UP_admin revoke their own Drop Table
privileges on Dx_TAB database to avoid accidental drop of objects. When required to
drop any object they can grant the Drop Table privilege themselves, complete the drop
action cautiously and revoke the privilege again.
You can use ERwin volumetrics to accurately calculate the size of tables, indexes, and
physical storage objects in your database. When you calculate database size and growth,
you can:
17
17
Version 1.0 Teradata - CONFIDENTIAL AND PROPRIETARY
Use Pursuant to Company Instructions
Appendix C. Physical Database
Design
This section provides tips on how to create an optimized database structure to get a high
performance DW. Note however, the final optimizations required cannot be done before a
more detailed knowledge of the data involved is available. The optimizations require data
demographics information as well as knowledge of the usage pattern.
Techniques for optimizing an application fall into 2 categories. One deals with the
physical data model (PDM), and the other deals with queries and the use of SQL features
such as temporary tables, derived tables, CASE feature, etc. The second category is not
described in this document
19
Version 1.0 Teradata - CONFIDENTIAL AND PROPRIETARY
Use Pursuant to Company Instructions
Collected statistics aid the Optimizer in choosing the best join plan. Without collected
statistics, the Optimizer could make an assumption that does not apply and could cause
the Optimizer to choose a poor join plan.
20
The purpose of this white paper is to provide a framework for evaluating tradeoffs in
physical database design for deployment of the Teradata Customer Relationship
Management (CRM) Application Suite. In this white paper we examine a variety of
options for primary indexing and table organization corresponding to various real-world
scenarios that have been implemented at client sites. Benchmarking has been undertaken
to quantify performance tradeoffs, along with spool space requirements, for the various
options. This white paper is the result of benchmarking and analysis efforts undertaken
by Carlos Bouloy, Stephen Brobst, Tim Grant, and Gonzalo Hidalgo. The content of this
document represents a synthesis of summary to date with the Teradata CRM Application
Suite. These are guidelines for implementation and must be tempered with the specific
circumstances of each client deployment. We have tried to capture the most commonly
discussed alternatives for physical design. If you have another scenario that should be
included, please let us know.
The primary index for a table defines the column(s) by which the table will be assigned to 21
virtual AMPS (vAMPs) within the parallel database. The general rule –of thumb for
primary index selection is to select column(s) that have the following two properties:
Provide a frequently used access path for joins and aggregates to maximize the
occurrence of localized operations in the parallel database
Provide a large set of unique values with relatively even distribution of rows
so as to balance the parallel workload across all available vAMPs
In this section, we will discuss the specific tradeoffs related to primary index selection for
the transaction table in a Teradata CRM implementation. The transaction table represents
detailed customer behaviors such as purchases in a retail outlet, call detail records (CDRs)
for a telecommunications provider, deposits and withdrawals for a financial services
company, and so on. Multiple kinds of transactions may be relevant in some customer
scenarios (inquiry versus purchase, payment versus claim, etc.). In this discussion, we
will consider the tradeoffs involved in six different primary indexing options:
21
Version 1.0 Teradata - CONFIDENTIAL AND PROPRIETARY
Use Pursuant to Company Instructions
Primary index on lowest level of customer hierarchy
Composite primary index on date, product, and location
Primary index on transaction_id
Composite primary index on date, product, and location with use of a join
index to provide co-location with the lowest level of the customer hierarchy
Cross-reference table implementation
Primary index on both lowest level of customer hierarchy and date, product,
and location using table duplication
These options will be described in the following subsections along with results from
benchmark testing, where applicable. The last subsection treats special considerations for
handling anonymous transactions.
As discussed previously, all Teradata CRM implementations will use one or more
customer hierarchies. A typical customer hierarchy would include household, party
(individual), and account. In such an example, account would be considered the lowest
level in the customer hierarchy. There may be more or fewer levels in the hierarchy,
depending on the specific implementation. Note that there can also be many-to-many
relationships among levels in the hierarchy in the data model. For example, an individual
can own multiple accounts and an account can have multiple (joint) owners. Teradata
CRM provides the ability for the business to specify (both static and dynamic) rules to
disambiguate many-to-many relationships. See the “Many-to-Many and Composite Key
Considerations” paper on the Teradata CRM Services Portal. Selecting the lowest level
of the customer hierarchy (e.g., account) will generally provide the highest performance
implementation scenario for Teradata CRM implementations. This choice of primary
index allows vAMP local operations and avoids the need for spool space allocation
related to large data distributions in most cases. There are two important issues to be
considered that need to be considered, however, when assessing this performance
advantage:
If a client has an existing data warehouse implementation that pre-dates its Teradata CRM
deployment, then it is likely that the transaction table already exists with a primary index
other than the lowest level of the customer hierarchy. One of the most common scenarios
(especially in retail) is to construct a composite primary index on date, product, and
location. A composite primary index with this construction is quite amenable to star join
types of queries whereby a cartesian product (star) join is executed using the qualifying
rows from the calendar, product, and location dimensions. The (relatively small) spool
file resulting from the cartesian product (star) join is duplicated to all vAMPS and a row-
hash merge join is executed extremely efficiently with the large transaction table.
23
Version 1.0 Teradata - CONFIDENTIAL AND PROPRIETARY
Use Pursuant to Company Instructions
The scenario described above is optimized for product-oriented analyses (such as would
be found in a category management application). Unfortunately, however, this choice of
primary index is not quite so optimized for the customer-oriented analyses as would
typically be found in the use of Teradata CRM. When joining between the lowest level of
the customer hierarchy (e.g., account) and the transaction table, there will not be co-
location between the two tables. This means that one of two forms of data re-distribution
will take place:
A hash join via scanning of the transaction table with probing into a hash
table constructed by scanning the qualifying accounts on each vAMP (assumes
hash join is enabled)
A local spooling of all qualifying transactions with a sort on the row hash of
account_id followed by a row-hash merge join with the qualifying account rows
duplicated to each vAMP
Note that the presence of a NUSI (non-unique secondary index) on account_id in the
transaction table will improve performance only in cases where the percent of qualifying
accounts is very small (generally much less than one percent) or when the join can be
satisfied completely from the NUSI (as a covered index for the transaction table).
Alternatively, if the qualifying rows from transaction table are hash re-distributed
according to account_id, then a row-hash merge join will subsequently be initiated.
It is important to consider that any time a re-distribution of data is performed, spool space
requirements for the system will be increased. For an all vAMP duplication of qualifying
accounts, the total spool space requirement will be equal to the size of the projection
(usually no more than a few tens of bytes) times the number of qualifying transactions
times the number of vAMPs in the configuration. Thus, for a 20 byte projection from the
account table, 10 million qualifying accounts, and a 64 vAMP configuration, the total
spool space requirement would be approximately 12.8GB per concurrent query. Plus, the
qualifying transactions would be spooled locally to facilitate sorting and a row-hash
merge join unless a hash join is used. If the transactions are hash re-distributed, then the
additional spool space required is the size of the projection from the transaction table
times the number of qualifying transactions. For a 50 byte projection from the transaction
table and 100 million qualifying transactions, the total spool space requirement would be
approximately 5GB per concurrent query.
Although performance will vary according to the specifics of each query executed,
benchmark testing with Teradata CRM on large client configurations indicates a
24
create table tx
(tx_id decimal(15,0) NOT NULL
,account_id decimal(12,0) NOT NULL
,tx_dt date FORMAT 'YYYY-MM-DD' NOT NULL
,product_id integer NOT NULL
,location_id integer NOT NULL
25
Version 1.0 Teradata - CONFIDENTIAL AND PROPRIETARY
Use Pursuant to Company Instructions
,tx_amt decimal(8,2) NOT NULL
,tx_type char(2) NOT NULL
,cost_amt decimal(8,2) NOT NULL
,item_qty integer NOT NULL
...
) primary index (tx_dt, product_id, location_id);
In this way, a "copy" of the most frequently used columns from the transaction table is
placed into an index with a primary index on account_id. Moreover, the index has been
ordered on the transaction date, so date range elimination can be used for queries where
that is appropriate. Use of the join index as described will not only eliminate the 20-30%
performance penalty related to a non-account primary index on the base table, but will
also yield a significant performance boost above and beyond co-located joins via date
range elimination when a query specifies a date range corresponding to a subset of the
transaction table.
In V2R4, join indexes will not be used except as covered indexes when considering
access to the base table. In other words, all columns requested from the transaction table
must be included in the join index for its use to be considered. This limitation will be
removed in V2R4.1 in which plans whereby the join index is joined back to the base table
will be considered. The current maximum of 16 columns in the join index will be
removed in V2R5 (available in 2002).
When pursuing the join index strategy, it is critical to consider the space and maintenance
overhead associated with the approach. A join index involves redundant storage of
whatever columns from the transaction table are specified in its definition. Of course, this
additional storage cost is somewhat offset by the reduction in spool space. Clients using
Teradata V2R4.1 or above should only store those transaction table columns that allow
coverage for the majority of queries rather than re-storing the full transaction table.
Maintenance for data loading is usually a more significant concern. The join index
construct will guarantee complete consistency between the index and base table(s).
However, keeping the index up-to-date comes with an associated cost. Each time a
record is inserted into the transaction table, its join index will need to be updated as well.
The cost of updating the join index in this scenario will be slightly higher than the row
insert itself. This cost must be included in the capacity planning for the data warehouse.
Join indexing is only an option for customers who are using SQL via Tpump or from
26
27
Version 1.0 Teradata - CONFIDENTIAL AND PROPRIETARY
Use Pursuant to Company Instructions
the transaction table to be upwards of 50% of the total data warehouse size. Thus,
duplicating the transaction table usually increases total storage utilization by more than
fifty percent on the client configuration. Moreover, load and indexing time for the
transaction table will be doubled from what it was prior to duplication.
An increase in total storage use by more than 50% to obtain a 20-30% performance
increase is not a particularly good design tradeoff. In fairness, the increase in storage use
is somewhat offset by a reduction in spool space requirements related to data re-
distribution. However, it is rare that spool space requirements for data re-distribution
(even in a heavy concurrent user environment) will even come close to approaching the
storage required to duplicate the largest table in the data warehouse.
In cases where pre-existing applications (or other considerations) demand that the
transaction table be primary indexed on something other than account_id, and the
Teradata CRM application is falling short of its performance service level agreement,
then the recommendation is to add overall system capacity rather than duplicate data.
Due to the scaleable nature of the Teradata system, addition of 30% more capacity (CPU
and storage to accommodate non-local join overhead and spool space requirements) will
eliminate the impact of a primary index on something other than account_id. Moreover,
addition of 30% more capacity without data duplication (also avoiding associated loading
and indexing penalty) will provide additional computing resources that will benefit all
workloads. This approach maximizes the value of the client investment in Teradata and
simplifies the overall implementation. Additional capacity can be added in this way to
meet whatever service level has been agreed upon at the client site.
28
The idea behind the split transaction table approach is to create two versions of the
transaction table: one for anonymous transactions and one for identified transactions. All
anonymous transactions are placed in a table that does not include a customer (account)
identifier since the transactions are unidentified. The primary index on the anonymous
transaction table is most likely a composite of date, product, and location. The identified
transactions are all placed in a transaction table with account_id as a primary index. A
view with an embedded UNION between the two tables is used to provide a complete
picture of all transactions for product-oriented analyses. Note that no data duplication is
initiated when using this technique.
Customer-oriented analyses will be optimized in performance using the split transaction
table technique because queries will only need to touch the identified transactions (the
ones of interest) and these transactions will have account_id as their primary index.
Performance cannot get any better than this for customer-oriented analyses.
Product-oriented analyses, on the other hand, will need to materialize the view that brings
the two transaction tables together into a single spool for analysis. Moreover, the
geography of the two transaction tables will be different (account_id versus date, product,
and location as the primary index) so the identified transactions will most likely be hash
re-distributed as part of the view materialization. All of this adds up to a fairly heavy
penalty for product-oriented analyses.
29
The performance advantage is approximately 30% for customer-oriented analyses using
the split transaction table approach versus the unified transaction table with account_id as
a primary index and constructed account_id keys for anonymous transactions (as
described in the next subsection) with a table containing 67% anonymous transactions.
The performance advantage is approximately 17% for customer-oriented analyses using
the split transaction table approach versus the unified transaction table with account_id as
a primary index and constructed account_id keys for anonymous transactions (as
described in the next subsection) with a table containing 33% anonymous transactions.
The performance advantage is a factor between 43% and 71% for customer-oriented
analyses using the split transaction table approach versus a single table with all
transactions primary indexed by date, product, and location. The performance advantage
factor depends on the percent of anonymous transactions ranging from 33% to 67%,
where best performance is with a large percentage of anonymous transactions due to
reduction in the size of the identified transaction table.
In contrast, the performance penalty is a factor of approximately 3.1 for product-oriented
analyses using the split transaction table approach (with UNION in a view) versus the
29
Version 1.0 Teradata - CONFIDENTIAL AND PROPRIETARY
Use Pursuant to Company Instructions
unified transaction table with account_id as a primary index and constructed account_id
keys for anonymous transactions (as described in the next subsection). The performance
penalty for product-oriented analyses using the split transaction table approach (with
UNION in a view) versus a single table with all transactions primary indexed by date,
product, and location is a factor of 100+ when the date, product, location combination is
relatively selective (allowing for a cartesian product join of the dimension tables followed
by a row-hash merge join with no local spooling of the transaction table) and a factor of
approximately 2.6 when the date, product, location combination is less selective (e.g.,
optimizer chooses not to use a cartesian product join).
One method that removes the primary indexing concerns related to anonymous
transactions is to avoid using account_id as the primary index. By sticking with date,
product, and location (or other key) as the primary index, the data distribution issues
associated with anonymous transactions are made (nearly) irrelevant. As long as we filter
31
out anonymous transactions in customer-oriented analysis through use of WHERE clause
predicates (to remove NULL, blank, or dummy account_id transactions), then we are
clear and free of the issues of concern. This approach is equivalent to those described in
sections 1.2 or 1.3. Of course, the downside of the approach is that customer-oriented
analyses suffer a penalty of approximately 20-30% in performance as well as spool space
overhead associated with data re-distribution. This approach may well be the most
desired one in environments where product analysis is heavily used relative to customer
analysis.
31
Version 1.0 Teradata - CONFIDENTIAL AND PROPRIETARY
Use Pursuant to Company Instructions
associated with an identifiable customer will be duplicated into a separate table that has
account_id as its primary index. This implementation delivers performance for customer-
oriented analyses equivalent to the split transaction table approach (described in section
1.7.1) and performance for product-oriented analyses equivalent to avoiding account_id
as the primary index (described in section 1.7.3). This is the best of both performance
worlds - at first glance.
In reality, this approach is basically the same as the table duplication as described in
section 1.6. Storage costs for the identified transactions will double, as will data loading
and indexing for these transactions. For customers with a small percentage of identified
transactions, this approach may seem very appealing because the amount of data
duplication is small relative to the performance advantages. However, over time it is
likely in almost any industry that the trend will be toward more and more transactions that
are identified (clearly seen in the trends toward loyalty cards, smart cards, m-commerce,
and so on). Thus, while this approach may seem attractive in the short-term for clients
who have many more anonymous transactions than those that are identified ¾ the
medium- to long-term result will be undesirable data duplication at large scale.
32
For client sites where Teradata CRM is the dominant data warehouse application, primary
indexing on the lowest level of the customer hierarchy (e.g., account_id) is generally
recommended. Exceptions to this recommendation are when there exist individual
accounts associated to tens of thousands of transactions (uneven distribution of
transactions to accounts) or when the lowest level of the customer hierarchy is highly
volatile (e.g., party_id).
For client sites where product-oriented applications (or other considerations) prevent the
transaction table from being primary indexed on account_id, use of a (non-unique)
composite primary index containing the most frequently (in combination) accessed
dimensional keys (e.g., tx_dt, product_id, location_id) is recommended to facilitate an
efficient star join. This will result in a 20-30% performance penalty and additional spool
space for customer-oriented queries (versus account_id as a primary index).
If there are performance delivery issues at the client site when using customer and
transaction tables that are not co-located, value-ordered join indexing is generally
recommended to change the geography of the most important columns from the
transaction table to allow for co-location and date range elimination with maximum
performance results. This approach will not work if the client is dependent on multiload
for loading the transaction table. Clients using join indexes should be encouraged to
upgrade to V2R4.1 as soon as possible to benefit from the significant join indexing 33
enhancements in this release.
Explicit data duplication via replication of the transaction table is an approach of last
resort. Since this approach involves duplication of the largest table in the data warehouse
and only yields a 20-30% benefit in performance versus a scenario where the account and
transaction tables are not co-located, it is generally preferable to add general purpose
capacity to meet whatever performance service level is required rather than use special
purpose table duplication.
For data warehouse implementations with anonymous transactions (e.g., retailers that
cannot identify customers for all transactions), the highest performance option for
customer-oriented queries will be the split transaction table approach using account_id as
the primary index for identified transactions and date, product, and location for the
primary index on anonymous transactions. However, product-oriented queries will be
penalized significantly with the split transaction table approach (a factor of 2.6 to 100+).
Product queries using the single transaction table with a primary index on account_id will
be roughly equivalent in performance to the primary index on date, product, and location
33
Version 1.0 Teradata - CONFIDENTIAL AND PROPRIETARY
Use Pursuant to Company Instructions
when a large cardinality of dates, products, and locations are specified. If a small
cardinality of dates, products, and locations is specified, then a primary index on date,
product, and location will perform approximately 36 times faster than when the
transaction table is primary indexed on account_id.
The table below shows the difference in performance for the two most likely primary
index choices for the transaction table in an environment where all (or nearly all)
transactions are identifiable to a customer (or account). The chart is meant to describe the
relative performance difference for “average” queries of four different types:
34
The table below shows the difference in performance for the three most likely primary
index choices for the transaction table in an environment where there is a mix between
identified and anonymous transactions. The chart is meant to describe the relative
performance difference for “average” queries in five distinct situations:
35
Version 1.0 Teradata - CONFIDENTIAL AND PROPRIETARY
Use Pursuant to Company Instructions
E.3 Many-to-many relationships in CRM 4.0
Duncan Ross
2nd July 2001
A common feature in client databases is the existence of many-to-many relationships
between tables. This is particularly common in the Financial Services Industry, where
customers can have many products (accounts), and each account may have many
customers.
Although direct many-to-many relationships occur in logical database models these are
usually expressed through intermediate tables in physical database models.
Wherever such a relationship between tables occurs users will have difficulties
interpreting and accessing their data.
Tools such as Teradata CRM 4.0 will also have this problem - and so a mechanism for
dealing with the problem needs to be found.
36
37
37
Version 1.0 Teradata - CONFIDENTIAL AND PROPRIETARY
Use Pursuant to Company Instructions
E.3.1 An example of a many-to-many relationship
The example on the prior page is taken from a life insurance company, but could be found
in many financial services organisations.
Customer is related to Policy in a many-to-many relationship, expressed through an
intermediate table CUST_POL_REL. To get meaningful information about our customer
we need to know which of their relationships with a policy we are using. Are they the
primary holder? Are they the life insured? Are they a secondary or tertiary life insured?
In CRM this can become even more complex as we may want to know answers to
questions such as:
Who are our policy holders who own a policy that is within six months of maturity where
the policy holder is not a life insured on that policy?
This view, and the other denormalised views, can now be accessed when necessary, and
have a simple one-to-many relationship to the customer table.bv
38
39
Version 1.0 Teradata - CONFIDENTIAL AND PROPRIETARY
Use Pursuant to Company Instructions
CUSTOMER
CUST_NO (PK)
ORIG_SYS (PK)
DATE_OF_BIRTH
NUM_POLICY
ORIG_BRANCH
NAME
FAM_NAME
ADDRESS1
ADDRESS2
ADDRESS3
POSTCODE
…
Generated Keys
The first option is to add a generated key to the table. This may be a compound of the
existing keys, or may be a totally independent value (such as an incrementing number).
In this case the original keys remain as attributes in the table.
CUSTOMER
UNIQUE_KEY (PK)
CUST_NO
ORIG_SYS
DATE_OF_BIRTH
NUM_POLICY
ORIG_BRANCH
NAME
FAM_NAME
ADDRESS1
ADDRESS2
ADDRESS3
POSTCODE
…
40
CUSTOMER
CUST_NO_ORIG_SYS
(PK)
DATE_OF_BIRTH
NUM_POLICY
ORIG_BRANCH
NAME
FAM_NAME
ADDRESS1
ADDRESS2
ADDRESS3
POSTCODE
…
Views
Where a customer has already populated their database and is unwilling to change the
design it may be possible to implement either of the above methods using views.
However, this would affect performance.
Is it a real problem?
One final approach is to determine if it is possible to use just one of the fields used in the
key to uniquely identify a record. Databases may be designed to cope with cases that
have not yet occurred.
In the example above it is known that two source systems could produce the same
customer number. The first system allocates numbers sequentially from 1, the second
sequentially from 1 000 000. Until there are at least 1 million customers this won’t be a
problem, and we can identify the customer table by CUST_NO alone. 41
This will delay the need for the requirement, but needs careful discussion with the client
to evaluate the issue.
41
Version 1.0 Teradata - CONFIDENTIAL AND PROPRIETARY
Use Pursuant to Company Instructions
Appendix F. V2R5 Nuggets
A number of new features have been introduced in V2R5. The field experience with these
features is rather limited, so in this document is just a few hints on what these features are
and how they can be useful in future implementations. This is not a full description of all
the V2R5 features, but a highlight of useful features in connection with physical
modeling.
43
F.3 Identity Columns
Identity Column’ is a new column attribute that allows generation of a unique number for
each row as rows are added to a table. They are useful if there is a need to automatically
generate unique values for a column. Eliminates the need to generate unique ids in an
application outside the database.
When an identity column table is being bulk-loaded for the first time, there could be an
initial performance hit as every VPROC that has rows reserves a range of numbers from
DBC.IdCol and sets up its local cache entry. Thereafter, as data skew spaces out the
numbers reservation, the contention should diminish.
Slight overhead in generating the numbers, rough estimate is a few seconds for every
couple of thousand rows inserted
Note the Identity column is not yet supported by the TD load tools TPUMP, Terabuilder,
Multiload, Fastload. It is though planned to be supported from release V2R5.1 >
43
Version 1.0 Teradata - CONFIDENTIAL AND PROPRIETARY
Use Pursuant to Company Instructions
F.4 Value list compression
This feature allows multiple values to be compressed on a column. Up to 255 distinct
values (plus NULL) may be compressed per fixed width column.
Benefits
Performance improvement, because there is less physical data to retrieve during scan-
oriented queries, especially useful during:
44
Update, delete, and insert performance can suffer. When we repeat a data
element in two or more tables, we can usually retrieve the values within this data
element much more quickly. However, if we have to change the value in this data
element, we need to change it in every table where it resides. If Bob Jones
appears in five different tables, and Bob would prefer to be called “Robert”, we
will need to change “Bob Jones” to “Robert Jones” in all five tables, which takes
longer than making this change to just one table.
Sometimes even read performance can suffer. We denormalize to increase
read or retrieval performance. Yet if too many data elements are denormalized
into a single entity, each record length can get very large and there is the potential
that a single record might span a database block size, which is the length of
contiguous memory defined within the database. If a record is longer than a block
size, it could mean that retrieval time will take much longer because now some of 45
the information the user requests will be in one block, and the rest of the
information could be in a different part of the disk, taking significantly more time
to retrieve. A Shipment entity I’ve encountered recently suffered from this
problem.
You may end up with too much redundant data. Let's say the
CUSTOMER LAST NAME data element takes up 30 characters. Repeating this
data element three times means we are now using 90 instead of 30 characters. In a
table with a small number of records, or with duplicate data elements with a fairly
short length, this extra storage space will not be substantial. However, in tables
with millions of rows, every character could require megabytes of additional
space.
It may mask lack of understanding. The performance and storage
implications of denormalizing are very database- and technology-specific. Not
fully understanding the data elements within a design, however, is more of a
functional and business concern, with potentially much worse consequences. We
45
Version 1.0 Teradata - CONFIDENTIAL AND PROPRIETARY
Use Pursuant to Company Instructions
should never denormalize without first normalizing. When we normalize, we
increase our understanding of the relationships between the data elements. We
need this understanding in order to know where to denormalize. If we just go
straight to a denormalized design, we could make very poor design decisions that
could require complete system rewrites soon after going into production. I once
reviewed the design for an online phone directory, where all of the data elements
for the entire design were denormalized into a single table. On the surface, the
table looked like it was properly analyzed and contained a fairly accurate primary
key. However, I started grilling the designer with specific questions about his
online phone directory design:
“What if an employee has two home phone numbers?”
“How can we store more than one email address for the same employee?”
“Can two employees share the same work phone number?”
After receiving a blank stare from the designer, I realized that
denormalization was applied before fully normalizing, and therefore, there
was a significant lack of understanding of the relationships between the data
elements.
It might introduce data quality problems. By having the same data element
multiple times in our design, we substantially increase opportunities for data
quality issues. If we update Bob's first name from Bob to Robert in 4 out of 5 of
the places his name occurred, we have potentially created a data quality issue.
Being aware of these potential dangers of denormalization encourages us to make
denormalization decisions very selectively. We need to have a full understanding of the
pros and cons of each opportunity we have to denormalize. This is where the
Denormalization Survival Guide becomes a very important tool. The Denormalization
Survival Guide will help us make the right denormalization decisions, so that our designs
can survive the test of time and minimize the chances of these bleak situations from
occurring.
DENORMALIZATION GUIDELINES
Normalization is the process of putting one fact in one appropriate place. This optimizes
updates at the expense of retrievals. When a fact is stored in only one place, retrieving
many different but related facts usually requires going to many different places. This
tends to slow the retrieval process. Updating is quicker, however, because the fact you're
updating exists in only one place.
It is generally recognized that all relational database designs should be based on a
normalized logical data model. With a normalized data model, one fact is stored in one
place, related facts about a single entity are stored together, and every column of each
entity refers non-transitively to only the unique identifier for that entity. Although an in-
depth discussion of normalization is beyond the scope of this article, brief definitions of
the first three normal forms follow:
In first normal form, all entities must have a unique identifier, or key, that can
be composed of one or more attributes. In addition, all attributes must be atomic
and non-repeating. (Atomic means that the attribute must not be composed of
46
synchronized. At any rate, all users should be informed of the implications of inconsistent
data if it is deemed impossible to avoid unsynchronized data.
When updating any column that is replicated in many different tables, always update it
everywhere that it exists simultaneously, or as close to simultaneously as possible given
the physical constraints of your environment. If the denormalized tables are ever out of
sync with the normalized tables be sure to inform end-users that batch reports and on-line
queries may not contain sound data; if at all possible, this should be avoided.
Finally, be sure to design the application so that it can be easily converted from using
denormalized tables to using normalized tables.
47
Version 1.0 Teradata - CONFIDENTIAL AND PROPRIETARY
Use Pursuant to Company Instructions
Many critical queries and reports exist which rely upon data from more than
one table. Often times these requests need to be processed in an on-line
environment.
Repeating groups exist which need to be processed in a group instead of
individually.
Many calculations need to be applied to one or many columns before queries
can be successfully answered.
Tables need to be accessed in different ways by different users during the
same timeframe.
Many large primary keys exist which are clumsy to query and consume a
large amount of DASD when carried as foreign key columns in related tables.
Certain columns are queried a large percentage of the time. Consider 60% or
greater to be a cautionary number flagging denormalization as an option.
Be aware that each new RDBMS release usually brings enhanced performance and
improved access options that may reduce the need for denormalization. However, most of
the popular RDBMS products on occasion will require denormalized data structures.
There are many different types of denormalized tables which can resolve the performance
problems caused when accessing fully normalized data. The following topics will detail
the different types and give advice on when to implement each of the denormalization
types.
Pre-Joined Tables
If two or more tables need to be joined on a regular basis by an application, but the cost of
the join is prohibitive, consider creating tables of pre-joined data. The pre-joined tables
should:
Report Tables
Often times it is impossible to develop an end-user report using SQL or QMF alone.
These types of reports require special formatting or data manipulation. If certain critical
or highly visible reports of this nature are required to be viewed in an on-line
environment, consider creating a table that represents the report. This table can then be
queried using SQL, QMF, and/or another report facility. The report should be created
using the appropriate mechanism (application program, 4GL, SQL, etc.) in a batch
environment. It can then loaded into the report table in sequence. The report table should:
Mirror Tables
If an application system is very active it may be necessary to split processing into two (or
more) distinct components. This requires the creation of duplicate, or mirror tables.
Consider an application system that has very heavy on-line traffic during the morning and
early afternoon hours. This traffic consists of both querying and updating of data.
Decision support processing is also performed on the same application tables during the
afternoon. The production work in the afternoon always seems to disrupt the decision
support processing causing frequent time outs and dead locks.
This situation could be corrected by creating mirror tables. A foreground set of tables
would exist for the production traffic and a background set of tables would exist for the
49
decision support reporting. A mechanism to periodically migrate the foreground data to
background tables must be established to keep the application data synchronized. One
such mechanism could be a batch job executing UNLOAD and LOAD utilities. This
should be done as often as necessary to sustain the effectiveness of the decision support
processing.
It is important to note that since the access needs of decision support are often
considerably different than the access needs of the production environment, different data
definition decisions such as indexing and clustering may be chosen for the mirror tables.
Split Tables
If separate pieces of one normalized table are accessed by different and distinct groups of
users or applications then consider splitting the table into two (or more) denormalized
tables; one for each distinct processing group. The original table can also be maintained if
other applications exist that access the entire table. In this scenario the split tables should
49
Version 1.0 Teradata - CONFIDENTIAL AND PROPRIETARY
Use Pursuant to Company Instructions
be handled as a special case of mirror table. If an additional table is not desired then a
view joining the tables could be provided instead.
Tables can be split in one of two ways: vertically or horizontally. Refer to Figure 2. A
vertical split cuts a table column-wise, such that one group of columns is placed into one
new table and the remaining columns are placed in another new table. A horizontally split
table is a row-wise split. To split a table horizontally, rows are classified into groups via
key ranges. The rows from one key range are placed in one table, those from another key
range are placed in a different table, and so on.
Vertically split tables should be created by placing the primary key columns for the old,
normalized table into both of the split tables. Designate one of the two, new tables as the
parent table for the purposes of referential integrity unless the original table still exists. In
this case, the original table should be the parent table in all referential constraints. If this
is the case, and the split tables are read only, do not set up referential integrity (RI) for the
split tables as they are being derived from a referentially intact source. RI would be
redundant.
When a vertical split is being done, always include one row per primary key in each split
table. Do not eliminate rows from either of the two tables for any reason. If rows are
eliminated the update process and any retrieval process that must access data from both
tables will be unnecessarily complicated.
When a horizontal split is being done, try to split the rows between the new tables to
avoid duplicating any one row in each new table. This is done by splitting using the
primary key such that discrete key ranges are placed in separate split tables. Simply
stated, the operation of UNION ALL, when applied to the horizontally split tables, should
not add more rows than contained in the original, un-split tables. Likewise, it should not
contain fewer rows either.
50 Combined Tables
Redundant Data
Sometimes one or more columns from one table are accessed whenever data from another
table is accessed. If these columns are accessed frequently with tables other than those in
which they were initially defined, consider carrying them in those other tables as
redundant data. By carrying these additional columns, joins can be eliminated and the
speed of data retrieval will be enhanced. This should only be attempted if the normal
access is debilitating.
Consider, once again, the DEPT and EMP tables. If most of the employee queries require
the name of the employee's department then the department name column could be
carried as redundant data in the EMP table. The column should not be removed from the
DEPT table, though (causing additional update requirements if the department name
changes).
In all cases columns that can potentially be carried as redundant data should be
characterized by the following attributes:
important users
Repeating Groups
When repeating groups are normalized they are implemented as distinct rows instead of
distinct columns. This usually results in higher DASD usage and less efficient retrieval
because there are more rows in the table and more rows need to be read in order to satisfy
queries that access the repeating group.
Sometimes, denormalizing the data by storing it in distinct columns can achieve
significant performance gains. However, these gains come at the expense of flexibility.
For example, consider an application that is storing repeating group information in the
normalized table below:
51
Version 1.0 Teradata - CONFIDENTIAL AND PROPRIETARY
Use Pursuant to Company Instructions
This table can store an infinite number of balances per customer, limited only by
available storage and the storage limits of the RDBMS. If the decision were made to
string the repeating group, BALANCE, out into columns instead of rows, a limit would
need to be set for the number of balances to be carried in each row. An example of this
after denormalization is shown below:
In this example, only six balances may be stored for any one customer. The number six is
not important, but the concept that the number of values is limited is important. This
reduces the flexibility of data storage and should be avoided unless performance needs
dictate otherwise.
Before deciding to implement repeating groups as columns instead of rows be sure that
the following criteria are met:
the data is rarely or never aggregated, averaged, or compared within the row
the data occurs in a statistically well-behaved pattern
the data has a stable number of occurrences
the data is usually accessed collectively
the data has a predictable pattern of insertion and deletion
If any of the above criteria are not met, SQL SELECT statements may be difficult to code
making the data less available due to inherently unsound data modeling practices. This
should be avoided because, in general, data is denormalized only to make it more readily
available.
Derivable Data
If the cost of deriving data using complicated formulae is prohibitive then consider
storing the derived data in a column instead of calculating it. However, when the
underlying values that comprise the calculated value change, it is imperative that the
stored derived data also be changed otherwise inconsistent information could be reported.
This will adversely impact the effectiveness and reliability of the database.
Sometimes it is not possible to immediately update derived data elements when the
52 columns upon which they rely change. This can occur when the tables containing the
Hierarchies
A hierarchy is a structure that is easy to support using a relational database such as DB2,
but is difficult to retrieve information from efficiently. For this reason, applications which
rely upon hierarchies very often contain denormalized tables to speed data retrieval. Two
examples of these types of systems are the classic Bill of Materials application and a
Departmental Reporting system. A Bill of Materials application typically records
information about parts assemblies in which one part is composed of other parts. A
Department Reporting system typically records the departmental structure of an
organization indicating which departments report to which other departments.
A very effective way to denormalize a hierarchy is to create what are called "speed"
tables. Figure 3 depicts a department hierarchy for a given organization. The hierarchic
tree is built such that the top most node is the entire corporation and each of the other
nodes represents a department at various levels within the corporation. In our example
department 123456 is the entire corporation. Departments 1234 and 56 report directly to
123456. Departments 12, 3, and 4 report directly to 1234 and indirectly to department
123456. And so on.
The table shown under the tree in Figure 3 is the classic relational implementation of a
hierarchy. There are two department columns, one for the parent and one for the child.
This is an accurately normalized version of this hierarchy containing everything that is
represented in the diagram. The complete hierarchy can be rebuilt with the proper data
retrieval instructions.
53
53
Version 1.0 Teradata - CONFIDENTIAL AND PROPRIETARY
Use Pursuant to Company Instructions
Even though the implementation effectively records the entire hierarchy, building a query
to report all of the departments under any other given department can be time consuming
to code and inefficient to process. Figure 4 shows a sample query that will return all of
the departments that report to the corporate node 123456. However, this query can only
be built if you know in advance the total number of possible levels the hierarchy can
achieve. If there are n levels in the hierarchy then you will need n-1 UNIONs.
A "speed" table can be built such as the one in Figure 5. This table depicts the parent
department and all of the departments under it regardless of the level. Contrast this to the
previous table which only recorded immediate children for each parent. A "speed" table
also commonly contains other pertinent information that is needed by the given
application. Typical information includes the level within the hierarchy for the given
node, whether or not the given node of the hierarchy is a detail node (at the bottom of the
tree), and, if ordering within level is important, the sequence of the nodes at the given
level.
54
After the "speed" table has been built, speedy queries can be written against this
implementation of a hierarchy. Figure 6 shows various informative queries that would
have been very inefficient to execute against the classical relational hierarchy. These
queries work for any number of levels between the top and bottom of the hierarchy.
A "speed" table is commonly built using a program written in COBOL or another high
level language. SQL alone is usually either too inefficient to handle the creation of a
55
"speed" table or impractical because the number of levels in the hierarchy is either
unknown or constantly changing.
55
Version 1.0 Teradata - CONFIDENTIAL AND PROPRIETARY
Use Pursuant to Company Instructions
Figure 6. Querying the Speed Table
Types of Denormalization
We have discussed nine different types of denormalization. The table below will
summarize the types of denormalization that are available with a short description of
when this type of denormalization is useful.
Summary
The decision to denormalize should never be made lightly because it involves a lot of
administrative dedication. This dedication takes the form of documenting the
denormalization decisions, ensuring valid data, scheduling of data migration, and keeping
end users informed about the state of the tables. In addition, there is one more category of
administrative overhead: periodic analysis.
Whenever denormalized data exists for an application the data and environment should be
periodically reviewed on an on-going basis. Hardware, software, and application
requirements will evolve and change. This may alter the need for denormalization. To
verify whether or not denormalization is still a valid decision ask the following questions:
Have the processing requirements changed for the application such that the join criteria,
timing of reports, and/or transaction throughput no longer require denormalized data?
Did a new DBMS release change performance considerations? For example, did the
introduction of a new join method undo the need for pre-joined tables?
56
I/O saved
CPU saved
complexity of update programming
cost of returning to a normalized design
It is important to remember that denormalization was initially implemented for
performance reasons. If the environment changes it is only reasonable to re-evaluate the
denormalization decision. Also, it is possible that, given a changing hardware and
software environment, denormalized tables may be causing performance degradation
instead of performance gains.
Simply stated, always monitor and periodically re-evaluate all denormalized applications.
57
57
Version 1.0 Teradata - CONFIDENTIAL AND PROPRIETARY
Use Pursuant to Company Instructions
References
This attachment describes pysical database design tips when interfacing with the Teradata
CRM product.
58