You are on page 1of 63

Physical Database Design Tips

Based on the FS-LDM

Version 2.0
December 2007

This document, and information herein, are the exclusive property of Teradata Corporation and all unauthorized
use and reproduction is prohibited. Copyright (C) 2008 by Teradata Corporation, Dayton, Ohio, USA.
All rights reserved. Printed in Denmark. Confidential unpublished property of Teradata Corporation
Document Changes
Rev. Date Section Comment
1.0 Jan 2003 All Initial Issue – deployed with TSM 4.1
2.0 Dec 2007 All Teradata Branding

Trademarks
All trademarks and service marks mentioned in this document are marks of their respective
owners and are as such acknowledged by Teradata Corporation.

Control Information
Page 58 is the last page of this document. This document is under Revision Control.

Version 1.0

Teradata - CONFIDENTIAL AND PROPRIETARY


Use Pursuant to Company Instructions iii
Contents

1. Introduction................................................................................................................1
1.1 Design considerations..........................................................................................1
1.2 Scope1

Appendix A. Design of the Physical Database...............................................5


A.1 Database Architecture.........................................................................................5
A.2 Naming convention.............................................................................................7
A.2.1 Explanation of names in hierarchy..............................................................8
A.3 Scripts for setup...................................................................................................9
A.3.1 Check original database space.....................................................................9
A.3.2 Create user EDWadmin...............................................................................9
A.3.3 Create static DB hierarchy........................................................................10
A.3.4 Create EDW databases..............................................................................10
A.3.5 Create Profiles and Users..........................................................................11
A.3.6 Grant access rights....................................................................................13
A.4 Recommendations.............................................................................................15

Appendix B. Sizing........................................................................................16
B.1 Erwin volumetric...............................................................................................16

Appendix C. Physical Database Design........................................................18


C.1 PDM Fundamentals...........................................................................................18

Appendix D. CRM Denormalization Guidelines.........................................21

iv

Version 1.0

Teradata - CONFIDENTIAL AND PROPRIETARY


Use Pursuant to Company Instructions
D.1 Primary Index Selection for the Transaction Table............................................21
D.2 Primary Index on Lowest Level of Customer Hierarchy...................................22
D.3 Composite Primary Index on Date, Product, and Location................................23
D.4 Primary Index on Transaction_id......................................................................25
D.5 Join Indexing.....................................................................................................25
D.6 Junction Table Implementation.........................................................................27
D.7 Table Duplication..............................................................................................27
D.8 Anonymous Customer (Account) Considerations..............................................28
D.8.1 Split Transaction Table.............................................................................29
D.8.2 Constructed Customer (Account) Key for Anonymous Transactions........30
D.8.3 Avoid Primary Index on account_id..........................................................31
D.8.4 Duplication of Identified Transactions......................................................31

Appendix E. Summary of Primary Indexing Recommendations...............33


E.1 Primary Index Performance Tradeoffs when all Transactions are Identified to an
Account_id........................................................................................................34
E.2 Primary Index Performance Tradeoffs when some Transactions are Anonymous35
E.3 Many-to-many relationships in CRM 4.0..........................................................36
E.3.1 An example of a many-to-many relationship.............................................38
E.3.2 Solutions available to version 4.0..............................................................38
E.4 Multiple Column Keys and CRM 4.0................................................................39
E.4.1 An example of a MCK..............................................................................39
E.4.2 What do we do in the current release?.......................................................39
E.4.3 Solutions Available to Version 4.0............................................................40

Appendix F. V2R5 Nuggets..........................................................................42


F.1 Materialized view..............................................................................................42
F.2 Partitioned Primary Index..................................................................................43
F.3 Identity Columns...............................................................................................43
F.4 Value list compression.......................................................................................44

Appendix G. More on denormalization...........................................................45

Version 1.0

Teradata - CONFIDENTIAL AND PROPRIETARY


Use Pursuant to Company Instructions v
vi

Version 1.0

Teradata - CONFIDENTIAL AND PROPRIETARY


Use Pursuant to Company Instructions
1. Introduction
This document gives tips on the transformation of the FS-LDM into a corresponding
physical data base design (PDD).
It consists of a number of documents harvested from actual implementations and white
papers/presentations from a number of sources.
The documents are all placed in appendices and referred to from an overview table in 1.1
Design considerations
Note: Most of the information in this document is based on implementations performed
under Teradata release V2R4. In V2R5 a number of new features can be use to achieve
the same things described here e.g use of roles instead of granting right to all users
separately. The design tips described in this document that apply to V2R4 can still be
used under V2R5.

1.1 Design considerations


The physical database design for enterprise data warehouse is focused on providing
flexibility and ease of maintenance for “ad hoc” queries. It is implemented in 3rd normal
form. Major data warehouse activities are data transformation and reporting/analysis.
Design of an enterprise data warehouse is lead by the logical data model (LDM).
The physical data base design for an application solution is focused on providing
performance for known or “canned” queries - those queries that are often and routinely
executed by the application. Major application solution activities for CRM are
calculations and processing/analysis. Design of an application solution is lead by the
PDM.
Teradata CRM will run with a 3NF model, but it will not run optimally. Teradata
provides a generic PDM, as well as physical design guidelines, that will help you design
the optimal database structure for your client.
The considerations are illustrated in Figure 1 Design Considerations

1.2 Scope
As we are harvesting best practices from different sources we found it difficult to make a
document with a natural flow that reads easily from cover to cover. Instead we have
created just a very small main document and moved the harvested material to separate
appendices. We believe a PDD document is used for reference and not cover to cover
reading so we have made the small main body sort of a reading guide. We have
established a list of the main TSM tasks related to PDD and for each of those indicated
which appendices cover the subjects of the task. The small main body contains a frame of
reference for the appendices.

Teradata - CONFIDENTIAL AND PROPRIETARY


Version 1.0 Use Pursuant to Company Instructions 1
The appendices are those assets we have harvested that we believe are of most value and
that provide PDD related information over and above what is available from standard
Teradata manuals and courseware.
Appendix D. CRM Denormalization Guidelines contains a white paper by Brobst et al
and discusses CRM. We realize that the FS-LDM Implementation Guide is not about
CRM. However, all of the white paper guidelines are relevant for any "dimensional"
denormalization or 3NF tuning, and thus for any FS-LDM implementation that will make
use of OLAP tools, which is virtually all implementations.

Design Considerations
Enterprise Data Application
Warehouse Solution

Flexibility Performance
Ease of Maintenance Optimization
• 3rd Normal Form • More Denormalized
• “Ad hoc” Query Centric • Star Schema
• Data Transformation • Snowflake Schema
• Reporting • “Canned” Query Centric
• Logical Data Model • Calculation / Process Centric
(LDM) Lead • Physical Data Model
(PDM) Lead
05/24/01 NCR Confidential 6

Figure 1 Design Considerations


In most datawarehouse implementations there is a requirement for both flexibility and
performance. This can be achieved as illustrated in Figure 2 Database/datamart
organization and access path. Careful analysis should be done to determine whether the
tables in the datamart should be realized as physical tables or as views. A general rule of
thumb is that big tables like events should be realized as views while small aggregate
tables either should be made physical via the R&P 1process or views that join the source
tables. Use of join index can be very useful in these instances.
A number of new features have been introduced in V2R5. The field experience with these
features is rather limited, so in this document are just a few hints on what these features
are and how they can be useful in future implementations. Please refer to Appendix F. for
more information.

1
2 The process of Replicating and Propagating to a dependant datamart

Teradata - CONFIDENTIAL AND PROPRIETARY


Use Pursuant to Company Instructions Version 1.0
Source Systems
Source Systems

Manual
Data

ETL

EDW

R&P

Datamart

AdHoc Std
OLAP Report
Report Querie
s
s s

Figure 2 Database/datamart organization and access path


The following table contains a subset of the task descriptions from TSM 4.0 and
reference(s) to the documents describing the subjects covered by that task.

Task Reference
Determine the design of databases and users Appendix A. Design of the Physical
(Teradata database objects) and their hierarchies. Database
Determine the databases used for tables, views, and
macros.
Determine how users access rights to views and
tables will be managed.
Design the DBA functions for changing database Appendix A. Design of the Physical
objects, users, granting access rights, monitoring Database
performance, monitoring table space, monitoring
spool space, and monitoring security.
Determine how privacy requirements will be met.
Determine the physical tables that will be Appendix B. PDM Fundamentals
implemented so that performance and availability
and
requirements are met for applications and ECTL.
Appendix D. CRM Denormalization
Document deviations from the LDM and the
Guidelines
reasons for a different physical design.
Determine the estimated data volume per table. Appendix B. Sizing

Teradata - CONFIDENTIAL AND PROPRIETARY


Version 1.0 Use Pursuant to Company Instructions 3
Task Reference
Determine the amount of data that will be inserted,
updated, or deleted during ECTL runs.
Identify candidate summary tables. Note any items Appendix D. CRM Denormalization
that are considered out of scope for this project that Guidelines
may be addressable in future projects or iterations
of the data warehouse.

Teradata - CONFIDENTIAL AND PROPRIETARY


Use Pursuant to Company Instructions Version 1.0
Appendix A. Design of the
Physical Database
The physical database design is one of the initial steps in the Design stage of a Data
Warehouse implementation project according to the TSM 4.0 (DSA.10 and DCC.10).
The Teradata Design Methodology describes among other things the process required for
creating a Physical database design. This is a process in which a logical data model is
transformed into a physical data model by applying a set of proven transformation rules
but also taken into account business and technical policies, opportunities and constraints.
In the examples below it is assumed that development, test and production environment is
implanted on the same system, a situation arising very often with new Teradata
customers.

A.1 Database Architecture


The recommended system hierarchy concept is shown below. It shows an example on
how to define system administrator(s) and users with different access rights. The
illustration only show the production environment. The test and development
environments should be implemented with exactly the same structure whether they are
installed on the same machine or not. Users who do backup will be added within the same
framework.

Teradata - CONFIDENTIAL AND PROPRIETARY


Version 1.0 Use Pursuant to Company Instructions 5
Figure 3 Teradata System Hierarchy Concepts
Another representation of the hierarchy is shown in Figure 4.
In this example the hierarchy is subdivided into 3 similar areas in order to separate
development, test, and production; only the production hierarchy is fully expanded in the
figure.

Figure 4 Overview of the EDW Database Setup


The usage of the various databases is as follows:

6
Database Usage

Teradata - CONFIDENTIAL AND PROPRIETARY


Use Pursuant to Company Instructions Version 1.0
DP_LOG Log tables , duplicate tables, error tables created by Teradata Tools during
data load operation
DP_MDB Tables and views for the dependent MSTR datamart
DP_STG Load ready tables, and intermediate help tables
DP_STG_* Extracts staged w. FL by source system
DP_TAB The EDW tables
DP_UTL Utilities for EDW (macros etc.)
DP_VEW Views for accessing the EDW tables
DP_WRK Multiload worktables (End Users Access)
DP_DUP Tables to hold Duplicate records of DP_TAB tables

Table 1 Database Usage


And the users databases are to be used as follows:

Database Usage
EDWadmin The top super user for administration of the EDW hierarchy
PP_Admins Administrative users responsible for administrating the database objects
and users
PA_Admins Administrative users responsible for the administration of the Teradata
database
PP_BackupUsers Users for running the BAR processes
PP_LoadUsers Users responsible for staging, transforming, and loading data

PP_EndUsers Application users


PP_PowerUsers Users responsible for developing applications and creating views for the
users to access the data
PP_MDB_Admin Administrator Users responsible for the Data Mart (DP_MDB)
PP_RepUsers Users responsible for running the Replication and Propagation processes
to update the Data Mart (DP_MDB).
Table 2 User Database Usage

A.2 Naming convention


The convention used for naming of databases and users within the hierarchy is as follows:
The name EDWadmin is the super-user.
Names like D_* (e.g. D_Databases or D_ProdStaging) are databases used as placeholders
for underlying databases or users.

Teradata - CONFIDENTIAL AND PROPRIETARY


Version 1.0 Use Pursuant to Company Instructions 7
Names like DP_*, DD_* and DT_* are actual databases containing database objects, e.g.
DT_TAB means tables in test database.
Names like PP_*, PD_* and PT_* are databases corresponding to user-profiles. These
profiles implements the accessrights for the database objects for all the users under them.
This means that e.g. PD_Loadusers is a profile that implements access-rights for the users
that are running loadjobs (write-access to staging-areas, log-areas etc.) and if the user
UD_Load_MD belongs to this profile he will inherit his access-rights.
Names like UP_*, UD_* and UT_* are actual users that can logon to Teradata and will
have access to objects in their corresponding environment. For example UD_Fox is a user
that has access to the development environment. UE_* denotes endusers e.g. reporting
tool users or ad hoc users.

A.2.1 Explanation of names in hierarchy


The names connected to the Production hierarchy will be further explained as follows, but
the names in the Development and Test hierarchies have similar meaning. Extensions can
be done and some possible ones are also listed.
DP_TAB Contains Tables for EDW.
DP_UTL Utilities for EDW (macros etc.)
DP_WRK Work/Temp-tables for reporting tools and power users.
DP_VEW Contains Views, which the end users will "use". End users will only have
access to the database tables via these views, since this provides better flexibility and to
satisfy the reporting tools we could also implement views that are based on any kind of
select statement. Views can actually be implemented by using the “Locking table for
access” modifier for the select-statement, which may prove beneficial in the environment.
DP_STG Staging area for cross-source system staging tables, e.g. result sets from joins
between data in the different source systems.
DP_STG_* Staging area for the source systems (AR, BC, CS, IP, MD).
DP_LOG Logtables, used for storing Logtables for the results of the Load.
UP_Admin Administrator user for production environment only.
UP_Batch This (example) user runs batch. There can be several “batch” users.
UP_Load This (example) user runs Load. There could be several “load” users, e.g.
UP_Load_AR.
UP_MDB_Admin Administrator user for the Data Mart (DP_MDB) in the production
environment only.
UP_RepUsers This (example) user runs Replication (and Propagation) processes to
update the Data Mart (DP_MDB).
Possible extensions can be:
DP_HST Space where history of updates can be stored.
8

Teradata - CONFIDENTIAL AND PROPRIETARY


Use Pursuant to Company Instructions Version 1.0
A.3 Scripts for setup
This section contains the scripts for doing the setup of the initial database and they should
at all times reflect the actual environment. However, certain objects, like actual user-
names (e.g. UT_testload1) in the environment belonging to a profile (e.g. PT_LoadUsers)
can exist on and off.
The Users that belong to an administrator profile, e.g. user UP_Admin within
PP_Admins, are able to administer their corresponding environment specifically. For
example UP_Admin can administer the D_Production environment and the D_ProdUsers
environment. The UP_Admin can also create new profiles PD_* and create and assign
new users to the profiles.
The space allocations in the scripts will need modifications according to requirements
determined in the future. Similarly, the access rights will need some adjustments to reflect
the actual future needs. These adjustments are conveniently done via WinDDI.

A.3.1 Check original database space


/********************************************************
* Check original database space *
********************************************************/
select databasename, sum(maxperm), sum(maxspool) from dbc.diskspace
group by 1 order by 1;

select databasename, ownername, creatorname


from dbc.databases
order by 2, 1;

A.3.2 Create user EDWadmin


-- Create the EDWadmin, must be run as DBC
-- Make sure that DBC can GRANT
grant all on dbc to dbc with grant option;

Create user EDWadmin from DBC as password = EDWadmin, perm = 5000000000,


spool = 200000000000 No fallback collation = host;
grant all on EDWadmin to EDWadmin with grant option;

Teradata - CONFIDENTIAL AND PROPRIETARY


Version 1.0 Use Pursuant to Company Instructions 9
grant select on dbc to EDWadmin with grant option;
grant monitor privileges to EDWadmin with grant option;

grant delete on dbc.acctg to EDWadmin with grant option ;


grant delete on dbc.eventlog to EDWadmin with grant option ;
grant delete on dbc.acclogtbl to EDWadmin with grant option ;
grant delete on dbc.resusageicpu to EDWadmin with grant option ;
grant delete on dbc.resusageipma to EDWadmin with grant option ;
grant delete on dbc.resusageivpr to EDWadmin with grant option ;
grant delete on dbc.resusagescpu to EDWadmin with grant option ;
grant delete on dbc.resusagesctl to EDWadmin with grant option ;
grant delete on dbc.resusageshst to EDWadmin with grant option ;
grant delete on dbc.resusagesldv to EDWadmin with grant option ;
grant delete on dbc.resusagesobj to EDWadmin with grant option ;
grant delete on dbc.resusagespma to EDWadmin with grant option ;
grant delete on dbc.resusagesvpr to EDWadmin with grant option ;
grant delete on dbc.resusagesvpr2 to EDWadmin with grant option ;
-- grant for logon and logging
grant execute on dbc.logonrule to EDWadmin with grant option;
-- table does not exists: grant execute on dbc.acclogrule to EDWadmin
with grant option;
-- the user does not have insert with grant option access to
dbc.syssecdefaults:
-- grant select, insert, update, delete on dbc.syssecdefaults to EDWadmin
with grant option;

A.3.3 Create static DB hierarchy


/***********************************************************
* Create static database hierarchy, run as EDWadmin *
***********************************************************/
create database D_SpoolReserve from EDWadmin as perm = 1000000000;
create database D_Databases from EDWadmin as perm = 0;
create database D_Production from D_Databases as perm = 0;
create database D_Development from D_Databases as perm = 0;
create database D_Test from D_Databases as perm = 0;
create database D_Users from EDWadmin as perm = 10000000; /*
100MB */

create database D_ProdStaging from D_Production as perm = 0;


create database D_DevStaging from D_Development as perm = 0;
create database D_TestStaging from D_Test as perm = 0;

create database D_ProdUsers from D_Users as perm = 0;


create database D_DevUsers from D_Users as perm = 100000000; /*
100MB */
create database D_TestUsers from D_Users as perm = 0;
create database D_Administrators from D_Users as perm = 0;

A.3.4 Create EDW databases

/********************************************************
* Create EDW databases, run as EDWadmin *
********************************************************/
create database DP_TAB from D_Production as perm = 0;
create database DP_UTL from D_Production as perm = 0;
create database DP_WRK from D_Production as perm = 0;
create database DP_VEW from D_Production as perm = 0;
create database DP_LOG from D_Production as perm = 0;
create database DP_MDB from D_Production as perm = 0;
10

Teradata - CONFIDENTIAL AND PROPRIETARY


Use Pursuant to Company Instructions Version 1.0
create database DP_STG from D_ProdStaging as perm = 0;
create database DP_STG_AR from D_ProdStaging as perm = 0;
create database DP_STG_CS from D_ProdStaging as perm = 0;
create database DP_STG_BC from D_ProdStaging as perm = 0;
create database DP_STG_IP from D_ProdStaging as perm = 0;
create database DP_STG_MD from D_ProdStaging as perm = 0;

create database DD_TAB from D_Development as perm = 0;


create database DD_UTL from D_Development as perm = 0;
create database DD_WRK from D_Development as perm = 0;
create database DD_VEW from D_Development as perm = 0;
create database DD_LOG from D_Development as perm = 0;
create database DD_MDB from D_Development as perm = 0;

create database DD_STG from D_DevStaging as perm = 0;


create database DD_STG_AR from D_DevStaging as perm = 0;
create database DD_STG_CS from D_DevStaging as perm = 0;
create database DD_STG_BC from D_DevStaging as perm = 0;
create database DD_STG_IP from D_DevStaging as perm = 0;
create database DD_STG_MD from D_DevStaging as perm = 0;

create database DT_TAB from D_Test as perm = 0;


create database DT_UTL from D_Test as perm = 0;
create database DT_WRK from D_Test as perm = 0;
create database DT_VEW from D_Test as perm = 0;
create database DT_LOG from D_Test as perm = 0;
create database DT_MDB from D_Test as perm = 0;

create database DT_STG from D_TestStaging as perm = 0;


create database DT_STG_AR from D_TestStaging as perm = 0;
create database DT_STG_CS from D_TestStaging as perm = 0;
create database DT_STG_BC from D_TestStaging as perm = 0;
create database DT_STG_IP from D_TestStaging as perm = 0;
create database DT_STG_MD from D_TestStaging as perm = 0;

A.3.5 Create Profiles and Users


/***********************************************************
* Create Profiles, run as EDWadmin *
***********************************************************/
create database PP_Admins from D_Administrators as perm = 0;
create database PD_Admins from D_Administrators as perm = 0;
create database PT_Admins from D_Administrators as perm = 0;

create database PP_MDB_Admins from D_ProdUsers as perm = 0;


create database PP_RepUsers from D_ProdUsers as perm = 0;

create database PD_MDB_Admins from D_DevUsers as perm = 0;


create database PD_RepUsers from D_DevUsers as perm = 0;

create database PT_MDB_Admins from D_TestUsers as perm = 0;


create database PT_RepUsers from D_TestUsers as perm = 0;

/* Production rights for profile PP_Admins */


grant all privileges on D_Production to all PP_Admins with grant option;
grant all privileges on DP_TAB to all PP_Admins with grant option;
grant all privileges on DP_UTL to all PP_Admins with grant option;
grant all privileges on DP_WRK to all PP_Admins with grant option;
grant all privileges on DP_VEW to all PP_Admins with grant option;
grant all privileges on DP_LOG to all PP_Admins with grant option;
grant all privileges on DP_MDB to all PP_Admins with grant option;
grant all privileges on D_ProdStaging to all PP_Admins with grant option;

Teradata - CONFIDENTIAL AND PROPRIETARY


Version 1.0 Use Pursuant to Company Instructions 11
grant all privileges on DP_STG to all PP_Admins with grant option;
grant all privileges on DP_STG_AR to all PP_Admins with grant option;
grant all privileges on DP_STG_CS to all PP_Admins with grant option;
grant all privileges on DP_STG_BC to all PP_Admins with grant option;
grant all privileges on DP_STG_IP to all PP_Admins with grant option;
grant all privileges on DP_STG_MD to all PP_Admins with grant option;

grant all privileges on D_ProdUsers to all PP_Admins with grant option;


grant all privileges on PP_BatchUsers to all PP_Admins with grant option;
grant all privileges on PP_EndUsers to all PP_Admins with grant option;
grant all privileges on PP_LoadUsers to all PP_Admins with grant option;
grant all privileges on PP_PowerUsers to all PP_Admins with grant option;
grant all privileges on PP_MDB_Admins to all PP_Admins with grant option;
grant all privileges on PP_RepUsers to all PP_Admins with grant option;

/* Development rights for profile PD_Admins */


grant all privileges on D_Development to all PD_Admins with grant option;
grant all privileges on DD_TAB to all PD_Admins with grant option;
grant all privileges on DD_UTL to all PD_Admins with grant option;
grant all privileges on DD_WRK to all PD_Admins with grant option;
grant all privileges on DD_VEW to all PD_Admins with grant option;
grant all privileges on DD_LOG to all PD_Admins with grant option;
grant all privileges on DD_MDB to all PD_Admins with grant option;
grant all privileges on D_DevStaging to all PD_Admins with grant option;
grant all privileges on DD_STG to all PD_Admins with grant option;
grant all privileges on DD_STG_AR to all PD_Admins with grant option;
grant all privileges on DD_STG_CS to all PD_Admins with grant option;
grant all privileges on DD_STG_BC to all PD_Admins with grant option;
grant all privileges on DD_STG_IP to all PD_Admins with grant option;
grant all privileges on DD_STG_MD to all PD_Admins with grant option;

grant all privileges on D_DevUsers to all PD_Admins with grant option;


grant all privileges on PD_BatchUsers to all PD_Admins with grant option;
grant all privileges on PD_EndUsers to all PD_Admins with grant option;
grant all privileges on PD_LoadUsers to all PD_Admins with grant option;
grant all privileges on PD_PowerUsers to all PD_Admins with grant option;
grant all privileges on PD_MDB_Admins to all PD_Admins with grant option;
grant all privileges on PD_RepUsers to all PD_Admins with grant option;

/* Test rights for profile PT_Admins */


grant all privileges on D_Test to all PT_Admins with grant option;
grant all privileges on DT_TAB to all PT_Admins with grant option;
grant all privileges on DT_UTL to all PT_Admins with grant option;
grant all privileges on DT_WRK to all PT_Admins with grant option;
grant all privileges on DT_VEW to all PT_Admins with grant option;
grant all privileges on DT_LOG to all PT_Admins with grant option;
grant all privileges on DT_MDB to all PT_Admins with grant option;
grant all privileges on D_TestStaging to all PT_Admins with grant option;
grant all privileges on DT_STG to all PT_Admins with grant option;
grant all privileges on DT_STG_AR to all PT_Admins with grant option;
grant all privileges on DT_STG_CS to all PT_Admins with grant option;
grant all privileges on DT_STG_BC to all PT_Admins with grant option;
grant all privileges on DT_STG_IP to all PT_Admins with grant option;
grant all privileges on DT_STG_MD to all PT_Admins with grant option;

grant all privileges on D_TestUsers to all PT_Admins with grant option;


grant all privileges on PT_BatchUsers to all PT_Admins with grant option;
grant all privileges on PT_EndUsers to all PT_Admins with grant option;
grant all privileges on PT_LoadUsers to all PT_Admins with grant option;
grant all privileges on PT_PowerUsers to all PT_Admins with grant option;
grant all privileges on PT_MDB_Admins to all PT_Admins with grant option;
grant all privileges on PT_RepUsers to all PT_Admins with grant option;

12

Teradata - CONFIDENTIAL AND PROPRIETARY


Use Pursuant to Company Instructions Version 1.0
/* Create the 3 administrators, one for each environment */
create user UP_Admin from PP_Admins as password = UP_Admin, perm = 0,
spool = 200000000000;
create user UD_Admin from PD_Admins as password = UD_Admin, perm = 0,
spool = 200000000000;
create user UT_Admin from PT_Admins as password = UD_Admin, perm = 0,
spool = 200000000000;

create database PP_PowerUsers from D_ProdUsers as perm = 0;


create database PP_BatchUsers from D_ProdUsers as perm = 0;
create database PP_LoadUsers from D_ProdUsers as perm = 0;
create database PP_EndUsers from D_ProdUsers as perm = 0;

create database PD_PowerUsers from D_DevUsers as perm = 10000000; /*


10MB */
create database PD_BatchUsers from D_DevUsers as perm = 0;
create database PD_LoadUsers from D_DevUsers as perm = 0;
create database PD_EndUsers from D_DevUsers as perm = 0;

create database PT_PowerUsers from D_TestUsers as perm = 0;


create database PT_BatchUsers from D_TestUsers as perm = 0;
create database PT_LoadUsers from D_TestUsers as perm = 0;
create database PT_EndUsers from D_TestUsers as perm = 0;

grant all privileges on DP_MDB to all PP_MDB_Admins with grant option;


grant all privileges on DD_MDB to all PD_MDB_Admins with grant option;
grant all privileges on DT_MDB to all PT_MDB_Admins with grant option;

/* Create the 3 MDB administrators, one for each environment */


create user UP_MDB_Admin from PP_MDB_Admins as password = UP_MDB_Admin,
perm = 0, spool=200000000000;
create user UD_MDB_Admin from PD_MDB_Admins as password = UD_MDB_Admin,
perm = 0, spool=200000000000;
create user UT_MDB_Admin from PT_MDB_Admins as password = UT_MDB_Admin,
perm = 0, spool=200000000000;

A.3.6 Grant access rights


/***********************************************************
* Create Access rights for Production, run as UP_Admin *
***********************************************************/

grant select, execute on DP_TAB to all PP_PowerUsers;


grant select, macro, execute on DP_UTL to all PP_PowerUsers;
grant table, view, macro, insert, select, update, delete, execute
on DP_WRK to all PP_PowerUsers;
grant view, macro, execute, select on DP_VEW to all PP_PowerUsers;
grant view, macro, execute, select on DP_MDB to all PP_PowerUsers;
grant select on DP_LOG to all PP_PowerUsers;

grant insert, select, update, execute on DP_TAB to all PP_LoadUsers;


grant table, view, macro, insert, select, update, delete, execute on DP_UTL to all PP_LoadUsers;
grant macro, insert, select, execute on DP_LOG to all PP_LoadUsers;
grant table, view, macro, insert, select, update, delete, execute on DP_STG to all PP_LoadUsers;
grant table, view, macro, insert, select, update, delete, execute on DP_STG_AR to all PP_LoadUsers;
grant table, view, macro, insert, select, update, delete, execute on DP_STG_CS to all PP_LoadUsers;
grant table, view, macro, insert, select, update, delete, execute on DP_STG_BC to all PP_LoadUsers;
grant table, view, macro, insert, select, update, delete, execute on DP_STG_IP to all PP_LoadUsers;
grant table, view, macro, insert, select, update, delete, execute on DP_STG_MD to all PP_LoadUsers;

grant insert, select, update, delete, execute on DP_MDB to all PP_RepUsers; /* Index might be
required */

Teradata - CONFIDENTIAL AND PROPRIETARY


Version 1.0 Use Pursuant to Company Instructions 13
grant execute, select on DP_ TAB to all PP_RepUsers;
grant insert, select, execute on DP_LOG to all PP_RepUsers;

grant select on DP_VEW to all PP_EndUsers;


grant select on DP_MDB to all PP_EndUsers;

grant select, execute on DP_TAB to all PP_BatchUsers;


grant select, macro, execute on DP_UTL to all PP_BatchUsers;
grant table, view, macro, insert, select, update, delete, execute
on DP_WRK to all PP_BatchUsers;
grant select on DP_VEW to all PP_BatchUsers;
grant select on DP_LOG to all PP_BatchUsers;

/***********************************************************
* Create Access rights for Development, run as UD_Admin *
***********************************************************/

grant select, execute on DD_TAB to all PD_PowerUsers;


grant select, macro, execute on DD_UTL to all PD_PowerUsers;
grant table, view, macro, insert, select, update, delete, execute
on DD_WRK to all PD_PowerUsers;
grant view, macro, execute, select on DD_VEW to all PD_PowerUsers;
grant view, macro, execute, select on DD_MDB to all PD_PowerUsers;
grant select on DD_LOG to all PD_PowerUsers;

grant insert, select, update, execute


on DD_TAB to all PD_LoadUsers;
grant table, view, macro, insert, select, update, delete, execute
on DD_UTL to all PD_LoadUsers;
grant macro, insert, select, execute
on DD_LOG to all PD_LoadUsers;
grant table, view, macro, insert, select, update, delete, execute
on DD_STG to all PD_LoadUsers;
grant table, view, macro, insert, select, update, delete, execute
on DD_STG_AR to all PD_LoadUsers;
grant table, view, macro, insert, select, update, delete, execute
on DD_STG_CS to all PD_LoadUsers;
grant table, view, macro, insert, select, update, delete, execute
on DD_STG_BC to all PD_LoadUsers;
grant table, view, macro, insert, select, update, delete, execute
on DD_STG_IP to all PD_LoadUsers;
grant table, view, macro, insert, select, update, delete, execute
on DD_STG_MD to all PD_LoadUsers;

grant insert, select, update, delete, execute on DD_MDB to all PD_RepUsers; /* Index might be
required */
grant execute, select on DD_ TAB to all PD_RepUsers;
grant insert, select, execute on DD_LOG to all PD_RepUsers;

grant select on DD_VEW to all PD_EndUsers;


grant select on DD_MDB to all PD_EndUsers;

grant select, execute on DD_TAB to all PD_BatchUsers;


grant select, macro, execute on DD_UTL to all PD_BatchUsers;
grant table, view, macro, insert, select, update, delete, execute
on DD_WRK to all PD_BatchUsers;
grant select on DD_VEW to all PD_BatchUsers;
grant select on DD_LOG to all PD_BatchUsers;

/***********************************************************
14 * Create Access rights for Test, run as UT_Admin *

Teradata - CONFIDENTIAL AND PROPRIETARY


Use Pursuant to Company Instructions Version 1.0
***********************************************************/

grant select, execute on DT_TAB to all PT_PowerUsers;


grant select, macro, execute on DT_UTL to all PT_PowerUsers;
grant table, view, macro, insert, select, update, delete, execute
on DT_WRK to all PT_PowerUsers;
grant view, macro, execute, select on DT_VEW to all PT_PowerUsers;
grant view, macro, execute, select on DT_MDB to all PT_PowerUsers;
grant select on DT_LOG to all PT_PowerUsers;

grant insert, select, update, execute


on DT_TAB to all PT_LoadUsers;
grant table, view, macro, insert, select, update, delete, execute
on DT_UTL to all PT_LoadUsers;
grant macro, insert, select, execute
on DT_LOG to all PT_LoadUsers;
grant table, view, macro, insert, select, update, delete, execute
on DT_STG to all PT_LoadUsers;
grant table, view, macro, insert, select, update, delete, execute
on DT_STG_AR to all PT_LoadUsers;
grant table, view, macro, insert, select, update, delete, execute
on DT_STG_CS to all PT_LoadUsers;
grant table, view, macro, insert, select, update, delete, execute
on DT_STG_BC to all PT_LoadUsers;
grant table, view, macro, insert, select, update, delete, execute
on DT_STG_IP to all PT_LoadUsers;
grant table, view, macro, insert, select, update, delete, execute
on DT_STG_MD to all PT_LoadUsers;

grant insert, select, update, delete, execute on DT_MDB to all PT_RepUsers; /* Index might be
required */
grant execute, select on DT_TAB to all PT_RepUsers;
grant insert, select, execute on DT_LOG to all PT_RepUsers;

grant select on DT_VEW to all PT_EndUsers;


grant select on DT_MDB to all PT_EndUsers;

grant select, execute on DT_TAB to all PT_BatchUsers;


grant select, macro, execute on DT_UTL to all PT_BatchUsers;
grant table, view, macro, insert, select, update, delete, execute
on DT_WRK to all PT_BatchUsers;
grant select on DT_VEW to all PT_BatchUsers;
grant select on DT_LOG to all PT_BatchUsers;

A.4 Recommendations
DBC and EDWadmin passwords has to be protected to avoid any unauthorized access to
the database. These passwords needs to be secured in an envelope and placed with a
defined mechanism to access only in a crisis.
It is recommended that EDWadmin and UP_admin revoke their own Drop Table
privileges on Dx_TAB database to avoid accidental drop of objects. When required to
drop any object they can grant the Drop Table privilege themselves, complete the drop
action cautiously and revoke the privilege again.

Teradata - CONFIDENTIAL AND PROPRIETARY


Version 1.0 Use Pursuant to Company Instructions 15
Appendix B. Sizing
The table below shows an example on how to collect sizing information and to project the
growth over 5 years. This is one way of doing it and is very detailed. It might be
sufficient to collect size information on the major tables and estimate the growth based on
these tables with perhaps a 10% markup for the minor tables.
Furthermore it is advisable to run sizing statistics on a regular basis on project the growth
over the next year. This will form input to the decision on either ordering more hardware
or start purging old data.

B.1 Erwin volumetric

You can use ERwin volumetrics to accurately calculate the size of tables, indexes, and
physical storage objects in your database. When you calculate database size and growth,
you can:

 Forecast hardware requirements.


 Evaluate the implications of database growth.
 Create "what if..." scenarios based on server, physical object, column and
table settings.
In ERwin, you can select any table and calculate its approximate size according to the
initial state or projected growth. After you calculate all table sizes in your database, you
can calculate the approximate size of the entire database. When estimating the size of a
database table, ERwin considers the datatypes that are native to the DBMS you are using.
Using ERwin volumetrics, you can:

 Manipulate server-specific column values such as NULL and variable-width


columns that influence table size calculations.
16

Teradata - CONFIDENTIAL AND PROPRIETARY


Use Pursuant to Company Instructions Version 1.0
 Include index files in database calculations and select appropriate physical
storage objects for each individual table in the database.
 Modify various parameters that affect database size calculations. For
example, you can change the number of bytes per character, adjust the amount of
space overhead per row, and include a log space factor to account for database
log space.
 Use the ERwin Data Browser to print volumetrics reports by Physical Object,
Database, and Table with appropriate totals.
 Modify table calculation settings in the Data Browser, and have the Data
Browser calculate database size in real-time.
The start up screen is shown in Figure 5 ERwin volumetric. This way of entering
volumetrics is also rather detailed, but by applying the same rule as described above with
the excel sheet, the task can be minimized quite a lot.

17

Figure 5 ERwin volumetric


Practitioners are encouraged to exploit the possibilities of the Erwin volumetric.

17
Version 1.0 Teradata - CONFIDENTIAL AND PROPRIETARY
Use Pursuant to Company Instructions
Appendix C. Physical Database
Design

This section provides tips on how to create an optimized database structure to get a high
performance DW. Note however, the final optimizations required cannot be done before a
more detailed knowledge of the data involved is available. The optimizations require data
demographics information as well as knowledge of the usage pattern.
Techniques for optimizing an application fall into 2 categories. One deals with the
physical data model (PDM), and the other deals with queries and the use of SQL features
such as temporary tables, derived tables, CASE feature, etc. The second category is not
described in this document

C.1 PDM Fundamentals


The fundamentals for a good physical data model are:
Minimize unnecessary movement of data
A choice of a primary index to minimize unnecessary data movement via row re-
distributions.
Row re-distributions occur for joins and aggregations to get the same column values
located on the same AMP. Minimizing row re-distributions often means defining a non-
unique primary index instead of a unique primary index. More often than not, the primary
key in the logical data model is used as a unique primary index. The chief advantage of
this choice is that one gets a relatively even distribution of data across all AMPs, which
means one can take full advantage of the parallelism that Teradata offers.
Provide good navigation through the data tables
However, real goal of the physical data model is not just to provide an even distribution
of data, but to provide the best way to navigate through the data tables. For example, in
many applications, there are several examples of a join table that relates two fact tables.
In many of these tables the PI consists of three or more columns. (e.g. PARTY CARD
table where the PI is Party Id, Card Id, Party Card Role Cd and Party Card Rel Start Dt).
Such an index makes sense in an OLTP application where all column values can be
specified with specific values. In a data warehouse application, the date column is used
for qualifying a row and not for navigation.
18

Teradata - CONFIDENTIAL AND PROPRIETARY


Use Pursuant to Company Instructions Version 1.0
As for the other two columns, one can start with the Party Id and obtain all the Card Id’s
associated with the Party Id, or one can start with the Card Id and find its associated Party
Id’s. If one always had both values available, there would be no need for the join table.
In general, for a warehouse, one starts with the highest level in a hierarchy and works
down to get all rows grouped together at the higher level. For example, one would more
likely ask a question about a given party and its associated data than about a card and its
associated data. Consequently, the primary index should be a single column (Party Id).
Avoid unnecessary overhead
Choosing a non-unique primary index can present some negative side effects that one
must be careful to avoid. When there are too many instances of a non-unique primary
index value, the overhead for duplicate row checks can cause inserts to run more than 10
times longer.
This problem can be avoided by either defining a unique secondary index, or declaring
the table as MULTISET instead of SET.
A unique secondary index costs disk storage space and processing overhead to maintain
the index at insert time. A multiset table means one must build in safeguards to avoid
inserting a duplicate row.
Use compression to reduce database size
Compressing null and common values with the SQL COMPRESS option. The value of
the SQL COMPRESS option is that a column value is replaced by a single bit whenever
the column value matches the compressed default. In particular, there are many columns
defining end dates (xxxx_end_dt) that usually are set to NULL or to the maximum date
that can be specified in the system. Since active parties, accounts, access methods will
have most of their end dates falling into this category, the end date columns are prime
candidates for compressing. Instead of reserving 8 bytes for a date column for a million
rows, only one bit in a byte will be used. Thus, up to 8 column definitions can be
represented by a single byte. The more columns in a table that are compressed, the greater
the savings.
Mutually exclusive columns in the same table should also be compressed.. There is no 19
need to have physical space reserved in the row when no data occurs in a mutually
exclusive column. Another typical example is where the most common value of a column
makes up 25% or more of all the rows in the table.
Data compression will result in smaller tables in terms of bytes stored on the disk. This
means fewer disk I/Os to read the table.
Adding NUSI for equality conditions. Depending on the query set, secondary indices can
result in reading of a smaller subset of table data. Which columns to set as secondary
indices depend on the application. No specific recommendation can be made without
more information.
Collect Statistics
Collecting statistics on primary index of small tables, on join columns and qualification
columns.

19
Version 1.0 Teradata - CONFIDENTIAL AND PROPRIETARY
Use Pursuant to Company Instructions
Collected statistics aid the Optimizer in choosing the best join plan. Without collected
statistics, the Optimizer could make an assumption that does not apply and could cause
the Optimizer to choose a poor join plan.

20

Teradata - CONFIDENTIAL AND PROPRIETARY


Use Pursuant to Company Instructions Version 1.0
Appendix D. CRM Denormalization
Guidelines
Prepared by: Stephen Brobst, Carlos Bouloy, Tim Grant, Gonzalo Hidalgo
Date Distributed: July 18, 2001

The purpose of this white paper is to provide a framework for evaluating tradeoffs in
physical database design for deployment of the Teradata Customer Relationship
Management (CRM) Application Suite. In this white paper we examine a variety of
options for primary indexing and table organization corresponding to various real-world
scenarios that have been implemented at client sites. Benchmarking has been undertaken
to quantify performance tradeoffs, along with spool space requirements, for the various
options. This white paper is the result of benchmarking and analysis efforts undertaken
by Carlos Bouloy, Stephen Brobst, Tim Grant, and Gonzalo Hidalgo. The content of this
document represents a synthesis of summary to date with the Teradata CRM Application
Suite. These are guidelines for implementation and must be tempered with the specific
circumstances of each client deployment. We have tried to capture the most commonly
discussed alternatives for physical design. If you have another scenario that should be
included, please let us know.

D.1 Primary Index Selection for the Transaction Table

The primary index for a table defines the column(s) by which the table will be assigned to 21
virtual AMPS (vAMPs) within the parallel database. The general rule –of thumb for
primary index selection is to select column(s) that have the following two properties:

 Provide a frequently used access path for joins and aggregates to maximize the
occurrence of localized operations in the parallel database
 Provide a large set of unique values with relatively even distribution of rows
so as to balance the parallel workload across all available vAMPs
In this section, we will discuss the specific tradeoffs related to primary index selection for
the transaction table in a Teradata CRM implementation. The transaction table represents
detailed customer behaviors such as purchases in a retail outlet, call detail records (CDRs)
for a telecommunications provider, deposits and withdrawals for a financial services
company, and so on. Multiple kinds of transactions may be relevant in some customer
scenarios (inquiry versus purchase, payment versus claim, etc.). In this discussion, we
will consider the tradeoffs involved in six different primary indexing options:

21
Version 1.0 Teradata - CONFIDENTIAL AND PROPRIETARY
Use Pursuant to Company Instructions
 Primary index on lowest level of customer hierarchy
 Composite primary index on date, product, and location
 Primary index on transaction_id
 Composite primary index on date, product, and location with use of a join
index to provide co-location with the lowest level of the customer hierarchy
 Cross-reference table implementation
 Primary index on both lowest level of customer hierarchy and date, product,
and location using table duplication
These options will be described in the following subsections along with results from
benchmark testing, where applicable. The last subsection treats special considerations for
handling anonymous transactions.

D.2 Primary Index on Lowest Level of Customer Hierarchy

As discussed previously, all Teradata CRM implementations will use one or more
customer hierarchies. A typical customer hierarchy would include household, party
(individual), and account. In such an example, account would be considered the lowest
level in the customer hierarchy. There may be more or fewer levels in the hierarchy,
depending on the specific implementation. Note that there can also be many-to-many
relationships among levels in the hierarchy in the data model. For example, an individual
can own multiple accounts and an account can have multiple (joint) owners. Teradata
CRM provides the ability for the business to specify (both static and dynamic) rules to
disambiguate many-to-many relationships. See the “Many-to-Many and Composite Key
Considerations” paper on the Teradata CRM Services Portal. Selecting the lowest level
of the customer hierarchy (e.g., account) will generally provide the highest performance
implementation scenario for Teradata CRM implementations. This choice of primary
index allows vAMP local operations and avoids the need for spool space allocation
related to large data distributions in most cases. There are two important issues to be
considered that need to be considered, however, when assessing this performance
advantage:

 Primary index volatility


 Data distribution characteristics
The transaction table is typically the largest table in a data warehouse. As a result, it is
relatively important that its primary key be somewhat stable. The Teradata database does
not allow direct updates of the primary index value in a table (a delete followed by an
insert must be initiated). If it is likely that there will be frequent changes to the primary
index value of a table, then careful consideration of the impacts on data maintenance
should be undertaken. In general, the account level in a customer hierarchy is relatively
stable because accounts do not (typically) change their unique identifier over time.
Accounts are opened and closed, but once a transaction is assigned to an account it
usually stays that way.
In contrast, party identifiers are notoriously more volatile. Parties usually refer to people
or businesses. The problem with party_id as a primary index is that it changes over time
22

Teradata - CONFIDENTIAL AND PROPRIETARY


Use Pursuant to Company Instructions Version 1.0
due to changes in merge/purge (individualization) rules and as new information about
individuals/businesses is obtained. For example, we may have two records for John
Smith and Jonathan Smith - each with a unique party_id because they are located at
different addresses. Subsequently, we may get a change of address from an old address to
the new address that brings the two records together, or perhaps additional information of
another sort is provided (Social Security Number, telephone number, birthday, etc.) that
allows linking of the two records into one. Depending on the individualization rules used
within the business, the scenarios can be numerous. The result, however, is that only one
of the two party_id numbers can survive after the merge of the two records. If party_id is
used as a column in the transaction table, then all transactions that used the old party_id
to identify their owner will need to be updated with the new (surviving) party_id.
Moreover, any tables related to the history of the merged party record (e.g., direct
mailings) will also need to be updated. There are design techniques to keep track of both
the before and after versions of the transactions when such changes take place, but these
are beyond the scope of this paper. Note that inserting a cross-reference table to map
between old and new party_id values will result in extremely inefficient performance
because an arbitrarily sized “chain” of party_id values can be constructed via multiple
individualization operations over time and the SQL executed would need to accommodate
such a scenario.
The second major issue to be considered is data distribution. All transactions with the
same "owner" (lowest level of the customer hierarchy) will be hashed to the same bucket
in the Teradata file system. While data distribution does not need to be perfectly
balanced for high performance, it is important to avoid huge disparities in data
distribution. For V2R4 and V2R4.1 the rule-of-thumb is to avoid a situation in which
tens of thousands of transactions map into a single primary index value. While this seems
like an abnormally large number of transactions for a single customer, it does happen.
For example, at a large catalog retailer implementation of Teradata CRM the is a single
customer who orders goods for a whole Indian reservation. The account associated with
these transactions numbers well into the tens of thousands. A large bank in the northeast
has a couple of institutional accounts that have transactions numbering into the many tens
of thousands. The CDRs for a large business account of a telecommunications company
can easily number into the tens of thousands. In V2R5 (availability in 2002), this
limitation consideration will be largely eliminated for implementations of the transaction
table where primary index partitioning is performed using the transaction date. 23

D.3 Composite Primary Index on Date, Product, and


Location

If a client has an existing data warehouse implementation that pre-dates its Teradata CRM
deployment, then it is likely that the transaction table already exists with a primary index
other than the lowest level of the customer hierarchy. One of the most common scenarios
(especially in retail) is to construct a composite primary index on date, product, and
location. A composite primary index with this construction is quite amenable to star join
types of queries whereby a cartesian product (star) join is executed using the qualifying
rows from the calendar, product, and location dimensions. The (relatively small) spool
file resulting from the cartesian product (star) join is duplicated to all vAMPS and a row-
hash merge join is executed extremely efficiently with the large transaction table.

23
Version 1.0 Teradata - CONFIDENTIAL AND PROPRIETARY
Use Pursuant to Company Instructions
The scenario described above is optimized for product-oriented analyses (such as would
be found in a category management application). Unfortunately, however, this choice of
primary index is not quite so optimized for the customer-oriented analyses as would
typically be found in the use of Teradata CRM. When joining between the lowest level of
the customer hierarchy (e.g., account) and the transaction table, there will not be co-
location between the two tables. This means that one of two forms of data re-distribution
will take place:

 All vAMP duplication of all qualifying accounts


 Hash re-distribution of all qualifying transactions by account_id
The form of data re-distribution to be used will be determined by Teradata's cost-based
optimizer to yield the most efficient join plan. One of the more important factors that will
influence the optimizer's choice is the relative cardinality of qualifying accounts versus
qualifying transactions. For example, a query that analyzes the purchases of all accounts
at a specific store for a particular SKU over a one-week period will be much more likely
to re-distribute the small number of qualifying transactions than a query that analyzes
total purchase behavior over the past five years of transactions for accounts owned by
persons over 65 years of age (assumed to be a small number of qualifying accounts).
In the case where all qualifying accounts are duplicated to all vAMPs, the subsequent step
in the join plan will generally be either of the following:

 A hash join via scanning of the transaction table with probing into a hash
table constructed by scanning the qualifying accounts on each vAMP (assumes
hash join is enabled)
 A local spooling of all qualifying transactions with a sort on the row hash of
account_id followed by a row-hash merge join with the qualifying account rows
duplicated to each vAMP
Note that the presence of a NUSI (non-unique secondary index) on account_id in the
transaction table will improve performance only in cases where the percent of qualifying
accounts is very small (generally much less than one percent) or when the join can be
satisfied completely from the NUSI (as a covered index for the transaction table).
Alternatively, if the qualifying rows from transaction table are hash re-distributed
according to account_id, then a row-hash merge join will subsequently be initiated.
It is important to consider that any time a re-distribution of data is performed, spool space
requirements for the system will be increased. For an all vAMP duplication of qualifying
accounts, the total spool space requirement will be equal to the size of the projection
(usually no more than a few tens of bytes) times the number of qualifying transactions
times the number of vAMPs in the configuration. Thus, for a 20 byte projection from the
account table, 10 million qualifying accounts, and a 64 vAMP configuration, the total
spool space requirement would be approximately 12.8GB per concurrent query. Plus, the
qualifying transactions would be spooled locally to facilitate sorting and a row-hash
merge join unless a hash join is used. If the transactions are hash re-distributed, then the
additional spool space required is the size of the projection from the transaction table
times the number of qualifying transactions. For a 50 byte projection from the transaction
table and 100 million qualifying transactions, the total spool space requirement would be
approximately 5GB per concurrent query.
Although performance will vary according to the specifics of each query executed,
benchmark testing with Teradata CRM on large client configurations indicates a
24

Teradata - CONFIDENTIAL AND PROPRIETARY


Use Pursuant to Company Instructions Version 1.0
performance penalty between 20% and 30% when the primary index is not selected as the
lowest level in the customer hierarchy for the transaction table. Benchmarking with the
Teradata CRM application using a date, product, location primary index on the
transaction table demonstrated spool space requirements between 5-7GB per 100M
transactions involved in the hash re-distribution on a 320 vAMP configuration with five
billion transactions in total. The specific spool space requirement will depend on the
number (and size) of columns selected from the transaction table in the query and the
total number of transactions re-distributed. If multiple queries require access to the
transaction table, it is likely that a synchronization scan could be initiated to avoid
multiple spool files when sharing is appropriate. Of course, specific numbers at each
client site will vary, but they should be within the general range of those described here.
When product-oriented (e.g., category management) and customer-oriented (e.g.,
Teradata CRM) applications coexist in the same data warehouse environment, tradeoffs
will need to be made. A performance penalty will be experienced by the product-oriented
queries when the primary index of the transaction table is selected as account rather than
as a composite of date, product, and location. The magnitude of this penalty will be
similar (in both execution time and spool space) to that experienced by the Teradata CRM
application when geography of the tables is not optimized for a specific query. Of course,
the specifics of each particular application will vary. Ultimately, the relative importance
of performance between the two applications will need to be considered. Frequency of
execution and service level agreements should be used as primary inputs when deciding
on the primary index in a mixed workload environment.

D.4 Primary Index on Transaction_id


Access via a transaction_id is highly unlikely for most applications. The main advantage
of transaction_id as a primary index is that even data distribution is guaranteed (assuming
transaction_id is the primary key for the transaction table). In general, use of
transaction_id as a primary index is only recommended when a choice of account_id
yields buckets of transactions in the tens of thousands and when a composite of date,
product, and location does not yield a useful join path for any meaningful workloads.
The performance penalty for using a transaction_id relative to an account_id is generally
between 20% and 30% (plus spool space requirements for data re-distribution).
Transaction_id as a primary index is common mainly in telecommunications 25
implementations.

D.5 Join Indexing


Join indexing can be used to provide an alternate geography for selected columns from
the transaction table. If the transaction table has its primary index as date, product, and
location it is possible to create a join index (without the join) that takes the transaction
table data and builds an index with account_id as the partitioning key.

create table tx
(tx_id decimal(15,0) NOT NULL
,account_id decimal(12,0) NOT NULL
,tx_dt date FORMAT 'YYYY-MM-DD' NOT NULL
,product_id integer NOT NULL
,location_id integer NOT NULL

25
Version 1.0 Teradata - CONFIDENTIAL AND PROPRIETARY
Use Pursuant to Company Instructions
,tx_amt decimal(8,2) NOT NULL
,tx_type char(2) NOT NULL
,cost_amt decimal(8,2) NOT NULL
,item_qty integer NOT NULL
...
) primary index (tx_dt, product_id, location_id);

create join index account_tx as


select (tx.account_id
,tx.tx_dt
,tx.location_id)
,(tx.product_id
,tx.tx_amt)
from tx
order by tx.tx_dt
primary index ( tx.account_id )
;

In this way, a "copy" of the most frequently used columns from the transaction table is
placed into an index with a primary index on account_id. Moreover, the index has been
ordered on the transaction date, so date range elimination can be used for queries where
that is appropriate. Use of the join index as described will not only eliminate the 20-30%
performance penalty related to a non-account primary index on the base table, but will
also yield a significant performance boost above and beyond co-located joins via date
range elimination when a query specifies a date range corresponding to a subset of the
transaction table.
In V2R4, join indexes will not be used except as covered indexes when considering
access to the base table. In other words, all columns requested from the transaction table
must be included in the join index for its use to be considered. This limitation will be
removed in V2R4.1 in which plans whereby the join index is joined back to the base table
will be considered. The current maximum of 16 columns in the join index will be
removed in V2R5 (available in 2002).
When pursuing the join index strategy, it is critical to consider the space and maintenance
overhead associated with the approach. A join index involves redundant storage of
whatever columns from the transaction table are specified in its definition. Of course, this
additional storage cost is somewhat offset by the reduction in spool space. Clients using
Teradata V2R4.1 or above should only store those transaction table columns that allow
coverage for the majority of queries rather than re-storing the full transaction table.
Maintenance for data loading is usually a more significant concern. The join index
construct will guarantee complete consistency between the index and base table(s).
However, keeping the index up-to-date comes with an associated cost. Each time a
record is inserted into the transaction table, its join index will need to be updated as well.
The cost of updating the join index in this scenario will be slightly higher than the row
insert itself. This cost must be included in the capacity planning for the data warehouse.
Join indexing is only an option for customers who are using SQL via Tpump or from
26

Teradata - CONFIDENTIAL AND PROPRIETARY


Use Pursuant to Company Instructions Version 1.0
staging tables to load the transaction table. Multiload is not currently supported for join
indexes and is not planned for the immediate future.

D.6 Junction Table Implementation


A more primitive method than join indexing that has been used for co-existing product
and customer oriented workloads is the use of “junction” tables. The idea is to construct
a junction table containing account_id, tx_dt, product_id, and location_id from the
transaction table, with account_id as the primary index. The transaction table is left with
its primary index as a composite of transaction date, product, and location. This approach
allows a local row-hash match join between the account table and the cross-reference
table to determine the accounts with qualifying transactions. Since transaction date,
product, and location are stored in the junction table, any accounts that do not possess
transactions that qualify against these commonly used filters can be eliminated prior to
joining against the actual transaction table (with row re-distribution). Moreover, this
technique allows hash re-distribution to take place for the spool containing only qualified
account to transaction joins (presumably smaller than hash-redistributing all qualifying
transactions for co-location with the account table). The underlying assumption is that
the majority of filtering on account selection criteria and transaction selection criteria has
taken place prior to re-distribution of the data rows. In addition, scanning the “skinny”
junction table will be less expensive than scanning the full width transaction table.
The Teradata CRM meta data must be adapted to force joins from the account table
through the cross-reference table to the transaction table. Join predicates against
replicated columns (e.g., date) must also be included in the meta data. The integrity of
the cross-reference table with the account and transaction tables must be maintained
manually, but support for multiload is not an issue in this scenario since the cross-
reference table is a normal table in the physical design.
The reality of the junction table approach is that it provides significant benefit only when
there are a large number of accounts that will be eliminated due to filtering or lack of
transactions that meet query qualifications. Moreover, there is a cost associated with the
extra join that will introduce a performance penalty versus a design without the junction
table. Experience has shown that the junction table approach is not usually worth the cost
(storage, complexity, maintenance) overhead. In general, this approach is not
27
recommended.

D.7 Table Duplication


The brute force method of coexisting product- and customer-oriented workloads with
maximum query performance is to duplicate the transaction table. One copy of the table
is primary indexed on a composite of date, product, and location (to support product-
oriented applications such as category management). A second copy is primary indexed
on account_id (to support Teradata CRM). All appropriate indexes on the two tables are
duplicated as well. In this way, queries from both applications perform with best
performance. Neither application suffers the 30% penalty associated with forced data re-
distribution at run-time.
While this method is an expedient method of implementing Teradata CRM at maximum
performance, it is usually not the best use of resources at a client site. The transaction
table is most likely the largest table in the data warehouse. In fact, it is not unusual for

27
Version 1.0 Teradata - CONFIDENTIAL AND PROPRIETARY
Use Pursuant to Company Instructions
the transaction table to be upwards of 50% of the total data warehouse size. Thus,
duplicating the transaction table usually increases total storage utilization by more than
fifty percent on the client configuration. Moreover, load and indexing time for the
transaction table will be doubled from what it was prior to duplication.
An increase in total storage use by more than 50% to obtain a 20-30% performance
increase is not a particularly good design tradeoff. In fairness, the increase in storage use
is somewhat offset by a reduction in spool space requirements related to data re-
distribution. However, it is rare that spool space requirements for data re-distribution
(even in a heavy concurrent user environment) will even come close to approaching the
storage required to duplicate the largest table in the data warehouse.
In cases where pre-existing applications (or other considerations) demand that the
transaction table be primary indexed on something other than account_id, and the
Teradata CRM application is falling short of its performance service level agreement,
then the recommendation is to add overall system capacity rather than duplicate data.
Due to the scaleable nature of the Teradata system, addition of 30% more capacity (CPU
and storage to accommodate non-local join overhead and spool space requirements) will
eliminate the impact of a primary index on something other than account_id. Moreover,
addition of 30% more capacity without data duplication (also avoiding associated loading
and indexing penalty) will provide additional computing resources that will benefit all
workloads. This approach maximizes the value of the client investment in Teradata and
simplifies the overall implementation. Additional capacity can be added in this way to
meet whatever service level has been agreed upon at the client site.

D.8 Anonymous Customer (Account) Considerations


Special considerations must be addressed when working with transactions that may be
associated with anonymous customers. This scenario is most common in traditional retail
data warehouses when customers pay in cash with no loyalty card or other identifier.
This case can also occur in other industries. For example, foreign currency exchange
transactions at a bank branch are typically anonymous. In such cases, the customer or
account identifier that would normally be associated to a transaction is typically left blank
(NULL) or is filled in with a special "dummy" value to indicate an unknown customer.
Working with anonymous transactions can be problematic because using the customer
(account) identifier as a primary index does not work well when its value is NULL, blank,
or assigned to a single value for a large number of transactions. Joins can also be
problematic when a single dummy value is placed in the account_id column for
anonymous transactions, even if account_id is not used as the primary index. This is
because hot (relatively overworked) vAMP situations will likely occur upon data re-
distribution unless the anonymous transactions are explicitly filtered out using a WHERE
clause predicate.
Four physical design techniques have been developed to address the challenges of
anonymous transactions. These techniques are designed to support both product-oriented
analyses (which need all transactions, including anonymous) as well as customer-oriented
analyses (which generally need only identified transactions). Depending on the
technique, performance optimization may be biased toward either product- or customer-
based analyses. The four techniques are as follows:

28

Teradata - CONFIDENTIAL AND PROPRIETARY


Use Pursuant to Company Instructions Version 1.0
 Split the transaction table in two parts by anonymous versus identified
transactions with UNION to bring data together for product-oriented analysis
 Assign the negation of transaction_id into the account_id column for
anonymous transactions
 Avoid primary index on account_id
 Separate out and duplicate identified transactions into a distinct table
The performance characteristics and other implications of each choice are discussed in the
following three subsections.

D.8.1 Split Transaction Table

The idea behind the split transaction table approach is to create two versions of the
transaction table: one for anonymous transactions and one for identified transactions. All
anonymous transactions are placed in a table that does not include a customer (account)
identifier since the transactions are unidentified. The primary index on the anonymous
transaction table is most likely a composite of date, product, and location. The identified
transactions are all placed in a transaction table with account_id as a primary index. A
view with an embedded UNION between the two tables is used to provide a complete
picture of all transactions for product-oriented analyses. Note that no data duplication is
initiated when using this technique.
Customer-oriented analyses will be optimized in performance using the split transaction
table technique because queries will only need to touch the identified transactions (the
ones of interest) and these transactions will have account_id as their primary index.
Performance cannot get any better than this for customer-oriented analyses.
Product-oriented analyses, on the other hand, will need to materialize the view that brings
the two transaction tables together into a single spool for analysis. Moreover, the
geography of the two transaction tables will be different (account_id versus date, product,
and location as the primary index) so the identified transactions will most likely be hash
re-distributed as part of the view materialization. All of this adds up to a fairly heavy
penalty for product-oriented analyses.
29
The performance advantage is approximately 30% for customer-oriented analyses using
the split transaction table approach versus the unified transaction table with account_id as
a primary index and constructed account_id keys for anonymous transactions (as
described in the next subsection) with a table containing 67% anonymous transactions.
The performance advantage is approximately 17% for customer-oriented analyses using
the split transaction table approach versus the unified transaction table with account_id as
a primary index and constructed account_id keys for anonymous transactions (as
described in the next subsection) with a table containing 33% anonymous transactions.
The performance advantage is a factor between 43% and 71% for customer-oriented
analyses using the split transaction table approach versus a single table with all
transactions primary indexed by date, product, and location. The performance advantage
factor depends on the percent of anonymous transactions ranging from 33% to 67%,
where best performance is with a large percentage of anonymous transactions due to
reduction in the size of the identified transaction table.
In contrast, the performance penalty is a factor of approximately 3.1 for product-oriented
analyses using the split transaction table approach (with UNION in a view) versus the

29
Version 1.0 Teradata - CONFIDENTIAL AND PROPRIETARY
Use Pursuant to Company Instructions
unified transaction table with account_id as a primary index and constructed account_id
keys for anonymous transactions (as described in the next subsection). The performance
penalty for product-oriented analyses using the split transaction table approach (with
UNION in a view) versus a single table with all transactions primary indexed by date,
product, and location is a factor of 100+ when the date, product, location combination is
relatively selective (allowing for a cartesian product join of the dimension tables followed
by a row-hash merge join with no local spooling of the transaction table) and a factor of
approximately 2.6 when the date, product, location combination is less selective (e.g.,
optimizer chooses not to use a cartesian product join).

D.8.2 Constructed Customer (Account) Key for Anonymous


Transactions

It is unacceptable to have the primary index of the transaction table be an account_id if


there are NULL or blank values in the column. Similarly, use of a single, pre-defined
“dummy” account_id value (e.g., a magic value like “-1” to indicate absence of a known
account_id) to indicate an anonymous transaction is unacceptable because there will be a
huge number of transactions assigned to the single hash bucket associated to the dummy
value. An alternative to these scenarios, which allows primary indexing on account_id to
optimize for customer-oriented analyses, is to construct the “dummy” account_id
uniquely for each anonymous transaction. By constructing the dummy account_id
uniquely for each anonymous transaction, it is guaranteed that the anonymous
transactions will be uniformly distributed across all hash buckets (and vAMPS) for
maximum performance in the parallel database.
The most straightforward method of constructing a unique account_id for anonymous
transactions is to make use of the (guaranteed unique) transaction_id. However, it is
important that we never have the potential to overlap the assignment of a transaction_id
with a valid account_id. If we assume that valid account_id values for identified
transactions are always positive, then we can negate the transaction_id before placing it in
the account_id column and be guaranteed that it is both unique across all anonymous
transactions and that it does not overlap with a valid account_id value for an identified
transaction.
This technique allows for the existence of a single transaction table with the lowest level
of the customer hierarchy assigned as the primary index. As discussed in the previous
section, this technique will not perform quite as well as the split transaction table
approach for customer-oriented analysis because the table contains anonymous
transactions that are not of interest for customer queries (performance penalty of between
17% and 30% depending on the number of anonymous transactions – ranging from 1/3 to
2/3). However, product-oriented analyses will perform significantly better than in the
split transaction table approach because view materialization at run-time with the UNION
of two large tables is avoided (approximately a factor of 3.1 in performance benefit
versus the split transaction table approach).
This approach is considered to be a reasonable compromise between the split transaction
table and avoiding the use of account_id as a primary index. The approach has a slight
bias toward customer-oriented analyses (as compared to primary indexing on date,
product, and location), but is not as extreme as the split table approach. Moreover, for
30

Teradata - CONFIDENTIAL AND PROPRIETARY


Use Pursuant to Company Instructions Version 1.0
market basket analysis, the constructed account_id from transaction_id guarantees that all
line items within a single transaction (market basket) will be hashed to the same vAMP.
In addition, the universe of all transactions will be well distributed across all vAMPS.
This leads to superior performance for market basket analysis because primary indexing
on account_id (including valid account_ids and account_ids constructed from
transaction_id) will guarantee that all items within a single basket are always co-located
for join and aggregate processing. To achieve this benefit, the SQL generated must
include joins on the account_id, in addition to transaction_id, when performing the
analysis. Note that the extra join is redundant (i.e., if two items are in the same
transaction, then clearly they are associated to the same account). However, it is
important to explicitly include this redundancy so that the optimizer is assured that co-
located joins will yield correct answers.
Note that market basket queries, despite looking at products, possess performance
characteristics more similar to customer-oriented analyses than product-oriented analyses.
This is because in a market basket analysis, a primary aspect of the query execution is to
examine product purchases within a single transaction. By definition, all product
purchases within a single transaction will be purchased by a single customer. The
technique described in this section whereby all transactions are in a single table with
primary index on account_id will yield as good, if not better, performance as compared to
a primary index on date, product, and location for market basket analysis. Moreover, the
market basket will perform better with account_id as a primary index on a single
transaction table (regardless of the percent of anonymous transactions) as compared to the
split transaction table approach (because we avoid view materialization with the
embedded UNION operation).

D.8.3 Avoid Primary Index on account_id

One method that removes the primary indexing concerns related to anonymous
transactions is to avoid using account_id as the primary index. By sticking with date,
product, and location (or other key) as the primary index, the data distribution issues
associated with anonymous transactions are made (nearly) irrelevant. As long as we filter
31
out anonymous transactions in customer-oriented analysis through use of WHERE clause
predicates (to remove NULL, blank, or dummy account_id transactions), then we are
clear and free of the issues of concern. This approach is equivalent to those described in
sections 1.2 or 1.3. Of course, the downside of the approach is that customer-oriented
analyses suffer a penalty of approximately 20-30% in performance as well as spool space
overhead associated with data re-distribution. This approach may well be the most
desired one in environments where product analysis is heavily used relative to customer
analysis.

D.8.4 Duplication of Identified Transactions


One way to deliver both product-oriented and customer-oriented analyses at maximum
performance is to duplicate identified transactions into a distinct table. Using this
approach, a table containing all transactions (anonymous and identified) would be
constructed with a primary index optimized for product-oriented analyses (e.g., primary
index on date, product, and location). In addition, (only) those transactions that can be

31
Version 1.0 Teradata - CONFIDENTIAL AND PROPRIETARY
Use Pursuant to Company Instructions
associated with an identifiable customer will be duplicated into a separate table that has
account_id as its primary index. This implementation delivers performance for customer-
oriented analyses equivalent to the split transaction table approach (described in section
1.7.1) and performance for product-oriented analyses equivalent to avoiding account_id
as the primary index (described in section 1.7.3). This is the best of both performance
worlds - at first glance.
In reality, this approach is basically the same as the table duplication as described in
section 1.6. Storage costs for the identified transactions will double, as will data loading
and indexing for these transactions. For customers with a small percentage of identified
transactions, this approach may seem very appealing because the amount of data
duplication is small relative to the performance advantages. However, over time it is
likely in almost any industry that the trend will be toward more and more transactions that
are identified (clearly seen in the trends toward loyalty cards, smart cards, m-commerce,
and so on). Thus, while this approach may seem attractive in the short-term for clients
who have many more anonymous transactions than those that are identified ¾ the
medium- to long-term result will be undesirable data duplication at large scale.

32

Teradata - CONFIDENTIAL AND PROPRIETARY


Use Pursuant to Company Instructions Version 1.0
Appendix E. Summary of Primary
Indexing Recommendations

For client sites where Teradata CRM is the dominant data warehouse application, primary
indexing on the lowest level of the customer hierarchy (e.g., account_id) is generally
recommended. Exceptions to this recommendation are when there exist individual
accounts associated to tens of thousands of transactions (uneven distribution of
transactions to accounts) or when the lowest level of the customer hierarchy is highly
volatile (e.g., party_id).
For client sites where product-oriented applications (or other considerations) prevent the
transaction table from being primary indexed on account_id, use of a (non-unique)
composite primary index containing the most frequently (in combination) accessed
dimensional keys (e.g., tx_dt, product_id, location_id) is recommended to facilitate an
efficient star join. This will result in a 20-30% performance penalty and additional spool
space for customer-oriented queries (versus account_id as a primary index).
If there are performance delivery issues at the client site when using customer and
transaction tables that are not co-located, value-ordered join indexing is generally
recommended to change the geography of the most important columns from the
transaction table to allow for co-location and date range elimination with maximum
performance results. This approach will not work if the client is dependent on multiload
for loading the transaction table. Clients using join indexes should be encouraged to
upgrade to V2R4.1 as soon as possible to benefit from the significant join indexing 33
enhancements in this release.
Explicit data duplication via replication of the transaction table is an approach of last
resort. Since this approach involves duplication of the largest table in the data warehouse
and only yields a 20-30% benefit in performance versus a scenario where the account and
transaction tables are not co-located, it is generally preferable to add general purpose
capacity to meet whatever performance service level is required rather than use special
purpose table duplication.
For data warehouse implementations with anonymous transactions (e.g., retailers that
cannot identify customers for all transactions), the highest performance option for
customer-oriented queries will be the split transaction table approach using account_id as
the primary index for identified transactions and date, product, and location for the
primary index on anonymous transactions. However, product-oriented queries will be
penalized significantly with the split transaction table approach (a factor of 2.6 to 100+).
Product queries using the single transaction table with a primary index on account_id will
be roughly equivalent in performance to the primary index on date, product, and location

33
Version 1.0 Teradata - CONFIDENTIAL AND PROPRIETARY
Use Pursuant to Company Instructions
when a large cardinality of dates, products, and locations are specified. If a small
cardinality of dates, products, and locations is specified, then a primary index on date,
product, and location will perform approximately 36 times faster than when the
transaction table is primary indexed on account_id.

E.1 Primary Index Performance Tradeoffs when all


Transactions are Identified to an Account_id

The table below shows the difference in performance for the two most likely primary
index choices for the transaction table in an environment where all (or nearly all)
transactions are identifiable to a customer (or account). The chart is meant to describe the
relative performance difference for “average” queries of four different types:

 Customer analysis typically performed using the Teradata CRM application


 Market basket analysis as would be performed using the Retail Decisions
application suite using MicroStrategy
 Selective product analysis as might be performed using a query tool such as
Business Objects with small enough cardinality in the selected date, product, and
location combination that a cartesian product (star) join is selected by the cost-
based optimizer
 Product analysis without sufficient selectivity on the dates (e.g., many
months) , products (e.g., a whole category), and locations (e.g., a whole region) to
allow for performance effective use of a cartesian product (star) join
In the customer and market basket analysis queries, primary indexing on the account_id
yields a 21% and 22% performance benefit (respectively) versus the primary index on
date, product, and location. On the other hand, the date, product, and location primary
index wins out over the account_id primary index by a factor of 43 for selective product
analysis. When the product analysis is not selective, then performance is even between
the two primary indexing choices. For more detail, please see the write-up in sections 1.1
through 1.6 of this document.

Primary Index Customer Market Basket Selective Product Unselective


Design Analysis Analysis Analysis Product Analysis
Account_id 100 100 4,300 100
Date, Product, 121 122 100 100
Location

Note: Units in the table above are units of performance


relative to a base of 1. For example, a performance rating of
4,300 is 43 times better than a rating of 100.

34

Teradata - CONFIDENTIAL AND PROPRIETARY


Use Pursuant to Company Instructions Version 1.0
E.2 Primary Index Performance Tradeoffs when some
Transactions are Anonymous

The table below shows the difference in performance for the three most likely primary
index choices for the transaction table in an environment where there is a mix between
identified and anonymous transactions. The chart is meant to describe the relative
performance difference for “average” queries in five distinct situations:

 Customer analysis typically performed using the Teradata CRM application


with 1/3 of the transactions as anonymous
 Customer analysis typically performed using the Teradata CRM application
with 2/3 of the transactions as anonymous
 Market basket analysis as would be performed using the Retail Decisions
application suite using MicroStrategy
 Selective product analysis as might be performed using a query tool such as
Business Objects with small enough cardinality in the selected date, product, and
location combination that a cartesian product (star) join is selected by the cost-
based optimizer
 Product analysis without sufficient selectivity on the dates (e.g., many
months) , products (e.g., a whole category), and locations (e.g., a whole region) to
allow for performance effective use of a cartesian product (star) join
In the customer analysis queries, the split transaction table approach performed between
17% and 30% better than the account_id with negation (because we are scanning
transactions we don’t care about) and between 71% and 90% better than the date, product,
location primary index (because of re-distribution and scanning of anonymous
transactions). Market basket analysis performed best with account_id (using negation on
the transaction_id for anonymous transactions) by 14% to 85% because of the re-
distribution of data when using the date, product, location primary index and the UNION
overhead in the split table approach. Selective product-oriented queries performed much
better with a date, product, and location primary index: by a factor or 111 versus split
tables and a factor of 36 versus the account_id (with negation) primary index. For more
detail, please see the write-up in sections 1.7 of this document.
35

Primary Index 1/3 2/3 Market Basket Selective Unselective


Anonymous Anonymous Analysis Product Product
Customer Customer Analysis Analysis
Analysis Analysis
Split Tables 100 100 185 11,100 260
Account_id 117 130 100 3,600 100
with negation
Date, Product, 171 190 114 100 100
Location
Note: Units in the table above are units of performance
relative to a base of 1. For example, a performance rating of
3,600 is 36 times better than a rating of 100.

35
Version 1.0 Teradata - CONFIDENTIAL AND PROPRIETARY
Use Pursuant to Company Instructions
E.3 Many-to-many relationships in CRM 4.0
Duncan Ross
2nd July 2001
A common feature in client databases is the existence of many-to-many relationships
between tables. This is particularly common in the Financial Services Industry, where
customers can have many products (accounts), and each account may have many
customers.
Although direct many-to-many relationships occur in logical database models these are
usually expressed through intermediate tables in physical database models.
Wherever such a relationship between tables occurs users will have difficulties
interpreting and accessing their data.
Tools such as Teradata CRM 4.0 will also have this problem - and so a mechanism for
dealing with the problem needs to be found.

36

Teradata - CONFIDENTIAL AND PROPRIETARY


Use Pursuant to Company Instructions Version 1.0
CUSTOMER CUST_POL_REL POLICY
CUSTID (PK) CUSTID (PK) POLID (PK)
DATE_OF_BIRTH POLID (PK) PUR_DATE
ADDRESS_1 REL_TYPE (PK) PAYMENT_AMT
ADDRESS_2 PAYMENT_TYPE
ADDRESS_3 PAYMENT_FREQ
POSTCODE EST_MAT_VAL
DATE_FIRST_POL TPI
… …

Example of Many to Many Relationship

37

37
Version 1.0 Teradata - CONFIDENTIAL AND PROPRIETARY
Use Pursuant to Company Instructions
E.3.1 An example of a many-to-many relationship
The example on the prior page is taken from a life insurance company, but could be found
in many financial services organisations.
Customer is related to Policy in a many-to-many relationship, expressed through an
intermediate table CUST_POL_REL. To get meaningful information about our customer
we need to know which of their relationships with a policy we are using. Are they the
primary holder? Are they the life insured? Are they a secondary or tertiary life insured?
In CRM this can become even more complex as we may want to know answers to
questions such as:
Who are our policy holders who own a policy that is within six months of maturity where
the policy holder is not a life insured on that policy?

E.3.2 Solutions available to version 4.0


The most appropriate approach is to map hierarchies to a series of views that denormalise
the relationships. This can be difficult if there are many possible relationships, but in
most cases there will be a limited number of frequently used relationships, allowing a
restricted number of views to be built and mapped.
For example a view for the principal owner policy can be created:

CREATE VIEW pol_owner_1 AS


(SELECT C.CUSTID,
P.POLID,
P.PUR_DATE,
P.PAYMENT_AMT,
P.PAYMENT_TYPE,
P.PAYMENT_FREQ,
P.EST_MAT_VAL,
P.TPI,

FROM CUSTOMER C,
CUST_POL_REL R,
POLICY P
WHERE C.CUSTID=R.CUSTID AND
R.REL_TYPE=1 AND
R.POLID=P.POLID);

This view, and the other denormalised views, can now be accessed when necessary, and
have a simple one-to-many relationship to the customer table.bv

38

Teradata - CONFIDENTIAL AND PROPRIETARY


Use Pursuant to Company Instructions Version 1.0
pol_owner_1
pol_owner_2
pol_owner_1
CUSTOMER pol_owner_2
CUSTID
CUSTID
POLID
CUSTID
POLID
CUSTID (PK) CUSTID
POLIDPUR_DATE
DATE_OF_BIRTH POLID PUR_DATE
PAYMENT_AMT
PUR_DATE
PAYMENT_AMT
ADDRESS_1 PUR_DATEPAYMENT_TYPE
PAYMENT_AMT
ADDRESS_2 PAYMENT_TYPE
PAYMENT_AMT
PAYMENT_FREQ
PAYMENT_TYPE
PAYMENT_FREQ
ADDRESS_3 PAYMENT_TYPE
EST_MAT_VAL
PAYMENT_FREQ
POSTCODE EST_MAT_VAL
PAYMENT_FREQ
TPI
EST_MAT_VAL
TPI
DATE_FIRST_POL EST_MAT_VAL
TPI… …
… TPI…

E.4 Multiple Column Keys and CRM 4.0


Duncan Ross
2nd July 2001

In an ideal world every table in a relational database can be uniquely identified by a


single column - the Primary Key.
Unfortunately, many real world data models include keys that are made up of several
fields. This could occur because of legacy systems or database design issues.
Applications accessing this data therefore need to be able to uniquely identify records
using this multiple column key.
Multiple column keys are also referred to as Multi Field Indexes or Composite Keys.

E.4.1 An example of a MCK


The following customer table is uniquely defined by two fields, one indicating a customer 39
number, the other the originating legacy system. This is needed because it is possible to
give the same customer number to two different people through the different operational
systems.

39
Version 1.0 Teradata - CONFIDENTIAL AND PROPRIETARY
Use Pursuant to Company Instructions
CUSTOMER
CUST_NO (PK)
ORIG_SYS (PK)
DATE_OF_BIRTH
NUM_POLICY
ORIG_BRANCH
NAME
FAM_NAME
ADDRESS1
ADDRESS2
ADDRESS3
POSTCODE

E.4.2 What do we do in the current release?


Teradata CRM 4.0 will not support multiple column keys for an entitiy within the
hierarchy. This means that for each level in the viewed hierarchy (e.g. Customer-
Account-Card) it expects one column to uniquely identify the row. One attribute to
identify Customer, one attribute to identify Account and one for Card.
The FS-LDM contains at least one example of a multiple column key.

E.4.3 Solutions Available to Version 4.0

Generated Keys
The first option is to add a generated key to the table. This may be a compound of the
existing keys, or may be a totally independent value (such as an incrementing number).
In this case the original keys remain as attributes in the table.

CUSTOMER
UNIQUE_KEY (PK)
CUST_NO
ORIG_SYS
DATE_OF_BIRTH
NUM_POLICY
ORIG_BRANCH
NAME
FAM_NAME
ADDRESS1
ADDRESS2
ADDRESS3
POSTCODE

40

Teradata - CONFIDENTIAL AND PROPRIETARY


Use Pursuant to Company Instructions Version 1.0
Concatenated Keys
A similar approach to generated keys, but formed by concatenating the existing keys.
This has the advantage that the original keys can be determined from the new key, but
may run into problems due to key size.

CUSTOMER
CUST_NO_ORIG_SYS
(PK)
DATE_OF_BIRTH
NUM_POLICY
ORIG_BRANCH
NAME
FAM_NAME
ADDRESS1
ADDRESS2
ADDRESS3
POSTCODE

Views
Where a customer has already populated their database and is unwilling to change the
design it may be possible to implement either of the above methods using views.
However, this would affect performance.

Is it a real problem?
One final approach is to determine if it is possible to use just one of the fields used in the
key to uniquely identify a record. Databases may be designed to cope with cases that
have not yet occurred.
In the example above it is known that two source systems could produce the same
customer number. The first system allocates numbers sequentially from 1, the second
sequentially from 1 000 000. Until there are at least 1 million customers this won’t be a
problem, and we can identify the customer table by CUST_NO alone. 41

This will delay the need for the requirement, but needs careful discussion with the client
to evaluate the issue.

41
Version 1.0 Teradata - CONFIDENTIAL AND PROPRIETARY
Use Pursuant to Company Instructions
Appendix F. V2R5 Nuggets
A number of new features have been introduced in V2R5. The field experience with these
features is rather limited, so in this document is just a few hints on what these features are
and how they can be useful in future implementations. This is not a full description of all
the V2R5 features, but a highlight of useful features in connection with physical
modeling.

F.1 Materialized view


What is a materialized view was asked on the atr mail list. The following is the answer
from Carrie Ballinger:
"Materialized Views" is a descriptive term, a marketing term, for the various
improvements and extensions we have made to the join index in V2R5. We are not
replacing join indexes, but rather building on them.
Some of these improvements are going to open new areas of functionality for Teradata
users, but nevertheless, you will still be coding "CREATE JOIN INDEX AS..."
Two of the more interesting join indexes (or "materialized views" if you prefer) you can
code in V2R5 are the so-called "Sparse Index" and "Global Index". Both "sparse" and
"global" are descriptive labels as well, you will not see them in the actual syntax you use.
The sparse index is a single table join index in which the CREATE JOIN INDEX
statement contains a WHERE clause that allows you to limit which rows of the base table
will participate. You make the primary index of the join index the column being indexed,
and then the only other thing you need to specify is rowID of the base table. For
example, you can exlude nulls, or only index rows within a given time sequence. You
could eliminate from your sparse index high frequency values that would result in uneven
distribution and possibly uneven work, or cause the optimizer not to use the index for
those values. This only works with a join index, not a NUSI.
The second interesting enhancement is being called "global index". It too is a join index,
but provides the opportunity of only performing a few-AMP, rather than an all-AMP
operations, offering tactical performance benefits that a NUSI (which is always all-AMP)
cannot . With the global index the join index is being used as an access path to the base
table, but the index structure is not "local to one AMP" as a NUSI is (thus the name
"global"). It can be thought of as a "hashed-NUSI", or it can be thought of as a "non-
unique USI". The value being indexed is defined as the primary index of the join index,
and since the join index row is hashed by PI value just as though it were a regular
Teradata table, essentially the index stucture is hashed, yet non-unique. You need to
include rowID as a repeating group in the join index. If you have a tactical query that has
an equality condition on a fairly unique (but not completely unique) column, access using
42

Teradata - CONFIDENTIAL AND PROPRIETARY


Use Pursuant to Company Instructions Version 1.0
a global index would involve only few AMPs, just the AMPs pointed to by the rowIDs,
instead of all AMPs. Global index takes advantage of the GROUP AMP capability of
V2R5, where ever possible.
And you can combine a sparse and a global index into one join index if you wish.
These new features open up tremendous opportunities, particularly for tactical queries,
but for the complex query workloads as well. And although marketing labels can be
confusing to the implementer, it's a very good thing, and a long overdue thing, that we are
promoting specific functionality within our new releases with this level of enthusiasm.

F.2 Partitioned Primary Index


Partitioned Primary Index is a new table organization to optimize the physical database
design for narrow range constraint queries.
Data within a table can be partitioned so Rows with the same value of a partitioning
function are physically grouped together
Benefit:

 Increases the available options to improve the performance of range queries


 Only rows of the qualified partitions in a query need to be accessed
Easy to manage:

 Simple specification in CREATE TABLE


 Single operation to alter partitioning:
 copy data in dropped partitions to another table
 delete rows of dropped partitions
 drop existing partitions
 add existing partitions

43
F.3 Identity Columns
Identity Column’ is a new column attribute that allows generation of a unique number for
each row as rows are added to a table. They are useful if there is a need to automatically
generate unique values for a column. Eliminates the need to generate unique ids in an
application outside the database.
When an identity column table is being bulk-loaded for the first time, there could be an
initial performance hit as every VPROC that has rows reserves a range of numbers from
DBC.IdCol and sets up its local cache entry. Thereafter, as data skew spaces out the
numbers reservation, the contention should diminish.
Slight overhead in generating the numbers, rough estimate is a few seconds for every
couple of thousand rows inserted
Note the Identity column is not yet supported by the TD load tools TPUMP, Terabuilder,
Multiload, Fastload. It is though planned to be supported from release V2R5.1 >

43
Version 1.0 Teradata - CONFIDENTIAL AND PROPRIETARY
Use Pursuant to Company Instructions
F.4 Value list compression
This feature allows multiple values to be compressed on a column. Up to 255 distinct
values (plus NULL) may be compressed per fixed width column.
Benefits
Performance improvement, because there is less physical data to retrieve during scan-
oriented queries, especially useful during:

 general ad-hoc workloads


 full-table scan applications
Furthermore, it reduces storage cost by storing more logical data per unit of physical
capacity

44

Teradata - CONFIDENTIAL AND PROPRIETARY


Use Pursuant to Company Instructions Version 1.0
Appendix G. More on
denormalization
The following articles from http://www.tdan.com/edatt1_tocf.htm by Steve Hoberman
and Craig S. Mullins give some food for thoughts.

We have to be very careful and selective where we introduce denormalization because it


can come with a huge price even though it can decrease retrieval time. The price for
denormalizing can take the form of these bleak situations:

 Update, delete, and insert performance can suffer. When we repeat a data
element in two or more tables, we can usually retrieve the values within this data
element much more quickly. However, if we have to change the value in this data
element, we need to change it in every table where it resides. If Bob Jones
appears in five different tables, and Bob would prefer to be called “Robert”, we
will need to change “Bob Jones” to “Robert Jones” in all five tables, which takes
longer than making this change to just one table.
 Sometimes even read performance can suffer. We denormalize to increase
read or retrieval performance. Yet if too many data elements are denormalized
into a single entity, each record length can get very large and there is the potential
that a single record might span a database block size, which is the length of
contiguous memory defined within the database. If a record is longer than a block
size, it could mean that retrieval time will take much longer because now some of 45
the information the user requests will be in one block, and the rest of the
information could be in a different part of the disk, taking significantly more time
to retrieve. A Shipment entity I’ve encountered recently suffered from this
problem.
 You may end up with too much redundant data. Let's say the
CUSTOMER LAST NAME data element takes up 30 characters. Repeating this
data element three times means we are now using 90 instead of 30 characters. In a
table with a small number of records, or with duplicate data elements with a fairly
short length, this extra storage space will not be substantial. However, in tables
with millions of rows, every character could require megabytes of additional
space.
 It may mask lack of understanding. The performance and storage
implications of denormalizing are very database- and technology-specific. Not
fully understanding the data elements within a design, however, is more of a
functional and business concern, with potentially much worse consequences. We

45
Version 1.0 Teradata - CONFIDENTIAL AND PROPRIETARY
Use Pursuant to Company Instructions
should never denormalize without first normalizing. When we normalize, we
increase our understanding of the relationships between the data elements. We
need this understanding in order to know where to denormalize. If we just go
straight to a denormalized design, we could make very poor design decisions that
could require complete system rewrites soon after going into production. I once
reviewed the design for an online phone directory, where all of the data elements
for the entire design were denormalized into a single table. On the surface, the
table looked like it was properly analyzed and contained a fairly accurate primary
key. However, I started grilling the designer with specific questions about his
online phone directory design:
“What if an employee has two home phone numbers?”
“How can we store more than one email address for the same employee?”
“Can two employees share the same work phone number?”
After receiving a blank stare from the designer, I realized that
denormalization was applied before fully normalizing, and therefore, there
was a significant lack of understanding of the relationships between the data
elements.

 It might introduce data quality problems. By having the same data element
multiple times in our design, we substantially increase opportunities for data
quality issues. If we update Bob's first name from Bob to Robert in 4 out of 5 of
the places his name occurred, we have potentially created a data quality issue.
Being aware of these potential dangers of denormalization encourages us to make
denormalization decisions very selectively. We need to have a full understanding of the
pros and cons of each opportunity we have to denormalize. This is where the
Denormalization Survival Guide becomes a very important tool. The Denormalization
Survival Guide will help us make the right denormalization decisions, so that our designs
can survive the test of time and minimize the chances of these bleak situations from
occurring.

DENORMALIZATION GUIDELINES

Normalization is the process of putting one fact in one appropriate place. This optimizes
updates at the expense of retrievals. When a fact is stored in only one place, retrieving
many different but related facts usually requires going to many different places. This
tends to slow the retrieval process. Updating is quicker, however, because the fact you're
updating exists in only one place.
It is generally recognized that all relational database designs should be based on a
normalized logical data model. With a normalized data model, one fact is stored in one
place, related facts about a single entity are stored together, and every column of each
entity refers non-transitively to only the unique identifier for that entity. Although an in-
depth discussion of normalization is beyond the scope of this article, brief definitions of
the first three normal forms follow:

 In first normal form, all entities must have a unique identifier, or key, that can
be composed of one or more attributes. In addition, all attributes must be atomic
and non-repeating. (Atomic means that the attribute must not be composed of
46

Teradata - CONFIDENTIAL AND PROPRIETARY


Use Pursuant to Company Instructions Version 1.0
multiple attributes. For example, EMPNO should not be composed of social
security number and last name because these are separate attributes.)
 In second normal form, all attributes that are not part of the key must depend
on the entire key for that entity.
 In third normal form, all attributes that are not part of the key must not
depend on any other non-key attributes.
Frequently, however, performance needs dictate very quick retrieval capability for data
stored in relational databases. To accomplish this, sometimes the decision is made to
denormalize the physical implementation. Denormalization is the process of putting one
fact in numerous places. This speeds data retrieval at the expense of data modification.
It is not the intention of this article to promote the concept of denormalization. Of course,
a normalized set of relational tables is the optimal environment and should be
implemented for whenever possible. Yet, in the real world, denormalization is sometimes
necessary. Denormalization is not necessarily a bad decision if implemented wisely. You
should always consider these issues before denormalizing:

 can the system achieve acceptable performance without denormalizing?


 will the performance of the system after denormalizing still be unacceptable?
will the system be less reliable due to denormalization?
If the answer to any of these questions is "yes," then you should avoid denormalization
because any benefit that is accrued will not exceed the cost. If, after considering these
issues, you decide to denormalize be sure to adhere to the general guidelines that follow.
If enough DASD is available at your shop, create two sets of tables: one set fully
normalized and another denormalized. Populate the denormalized versions by querying
the data in the normalized tables and loading or inserting it into the denormalized tables.
Your application can access the denormalized tables in a read-only fashion and achieve
performance gains. It is imperative that a controlled and scheduled population function is
maintained to keep the data in the denormalized and normalized tables synchronized.
If DASD is not available for two sets of tables, then maintain the denormalized tables
programmatically. Be sure to update each denormalized table representing the same entity
at the same time, or alternately, to provide a rigorous schedule whereby tables will be 47

synchronized. At any rate, all users should be informed of the implications of inconsistent
data if it is deemed impossible to avoid unsynchronized data.
When updating any column that is replicated in many different tables, always update it
everywhere that it exists simultaneously, or as close to simultaneously as possible given
the physical constraints of your environment. If the denormalized tables are ever out of
sync with the normalized tables be sure to inform end-users that batch reports and on-line
queries may not contain sound data; if at all possible, this should be avoided.
Finally, be sure to design the application so that it can be easily converted from using
denormalized tables to using normalized tables.

The Reason for Denormalization


Only one valid reason exists for denormalizing a relational design - to enhance
performance. However, there are several indicators which will help to identify systems
and tables which are potential denormalization candidates. These are:

47
Version 1.0 Teradata - CONFIDENTIAL AND PROPRIETARY
Use Pursuant to Company Instructions
 Many critical queries and reports exist which rely upon data from more than
one table. Often times these requests need to be processed in an on-line
environment.
 Repeating groups exist which need to be processed in a group instead of
individually.
 Many calculations need to be applied to one or many columns before queries
can be successfully answered.
 Tables need to be accessed in different ways by different users during the
same timeframe.
 Many large primary keys exist which are clumsy to query and consume a
large amount of DASD when carried as foreign key columns in related tables.
 Certain columns are queried a large percentage of the time. Consider 60% or
greater to be a cautionary number flagging denormalization as an option.
Be aware that each new RDBMS release usually brings enhanced performance and
improved access options that may reduce the need for denormalization. However, most of
the popular RDBMS products on occasion will require denormalized data structures.
There are many different types of denormalized tables which can resolve the performance
problems caused when accessing fully normalized data. The following topics will detail
the different types and give advice on when to implement each of the denormalization
types.

Pre-Joined Tables
If two or more tables need to be joined on a regular basis by an application, but the cost of
the join is prohibitive, consider creating tables of pre-joined data. The pre-joined tables
should:

 contain no redundant columns (matching join criteria columns)


 contain only those columns absolutely necessary for the application to meet
its processing needs
 be created periodically using SQL to join the normalized tables
The cost of the join will be incurred only once when the pre-joined tables are created. A
pre-joined table can be queried very efficiently because every new query does not incur
the overhead of the table join process.

Report Tables
Often times it is impossible to develop an end-user report using SQL or QMF alone.
These types of reports require special formatting or data manipulation. If certain critical
or highly visible reports of this nature are required to be viewed in an on-line
environment, consider creating a table that represents the report. This table can then be
queried using SQL, QMF, and/or another report facility. The report should be created
using the appropriate mechanism (application program, 4GL, SQL, etc.) in a batch
environment. It can then loaded into the report table in sequence. The report table should:

 contain one column for every column of the report


 have a clustering index on the columns that provide the reporting sequence
48
 not subvert relational tenets (such as, 1NF and atomic data elements)

Teradata - CONFIDENTIAL AND PROPRIETARY


Use Pursuant to Company Instructions Version 1.0
Report tables are ideal for carrying the results of outer joins or other complex SQL
statements. If an outer join is executed and then loaded into a table, a simple SELECT
statement can be used to retrieve the results of the outer join, instead of the complex
UNION technique shown in Figure 1. Some RDBMS products support an explicit outer
join function which can be used instead of the UNION depicted. However, depending on
the implementation, the explicit outer join may be simpler or more complex than the
UNION it replaces.

Figure 1. Outer Join Technique Using UNION

Mirror Tables
If an application system is very active it may be necessary to split processing into two (or
more) distinct components. This requires the creation of duplicate, or mirror tables.
Consider an application system that has very heavy on-line traffic during the morning and
early afternoon hours. This traffic consists of both querying and updating of data.
Decision support processing is also performed on the same application tables during the
afternoon. The production work in the afternoon always seems to disrupt the decision
support processing causing frequent time outs and dead locks.
This situation could be corrected by creating mirror tables. A foreground set of tables
would exist for the production traffic and a background set of tables would exist for the
49
decision support reporting. A mechanism to periodically migrate the foreground data to
background tables must be established to keep the application data synchronized. One
such mechanism could be a batch job executing UNLOAD and LOAD utilities. This
should be done as often as necessary to sustain the effectiveness of the decision support
processing.
It is important to note that since the access needs of decision support are often
considerably different than the access needs of the production environment, different data
definition decisions such as indexing and clustering may be chosen for the mirror tables.

Split Tables
If separate pieces of one normalized table are accessed by different and distinct groups of
users or applications then consider splitting the table into two (or more) denormalized
tables; one for each distinct processing group. The original table can also be maintained if
other applications exist that access the entire table. In this scenario the split tables should

49
Version 1.0 Teradata - CONFIDENTIAL AND PROPRIETARY
Use Pursuant to Company Instructions
be handled as a special case of mirror table. If an additional table is not desired then a
view joining the tables could be provided instead.

Tables can be split in one of two ways: vertically or horizontally. Refer to Figure 2. A
vertical split cuts a table column-wise, such that one group of columns is placed into one
new table and the remaining columns are placed in another new table. A horizontally split
table is a row-wise split. To split a table horizontally, rows are classified into groups via
key ranges. The rows from one key range are placed in one table, those from another key
range are placed in a different table, and so on.
Vertically split tables should be created by placing the primary key columns for the old,
normalized table into both of the split tables. Designate one of the two, new tables as the
parent table for the purposes of referential integrity unless the original table still exists. In
this case, the original table should be the parent table in all referential constraints. If this
is the case, and the split tables are read only, do not set up referential integrity (RI) for the
split tables as they are being derived from a referentially intact source. RI would be
redundant.
When a vertical split is being done, always include one row per primary key in each split
table. Do not eliminate rows from either of the two tables for any reason. If rows are
eliminated the update process and any retrieval process that must access data from both
tables will be unnecessarily complicated.
When a horizontal split is being done, try to split the rows between the new tables to
avoid duplicating any one row in each new table. This is done by splitting using the
primary key such that discrete key ranges are placed in separate split tables. Simply
stated, the operation of UNION ALL, when applied to the horizontally split tables, should
not add more rows than contained in the original, un-split tables. Likewise, it should not
contain fewer rows either.

50 Combined Tables

Teradata - CONFIDENTIAL AND PROPRIETARY


Use Pursuant to Company Instructions Version 1.0
If tables exist with a one-to-one relationship consider combining them into a single
combined table. Sometimes, even one-to-many relationships can be combined into a
single table, but the data update process will be significantly complicated because of the
increase in redundant data.
For example, consider an application with two tables: DEPT (containing department data)
and EMP (containing employee data). Combining the two tables into a large table named,
for example, EMP_WITH_DEPT. This new table would contain all of the columns of
both tables except for the redundant DEPTNO column (the join criteria). So, in addition
to all of the employee information, all of the department information would also be
contained on each employee row. This will result in many duplicate instances of the
department data. Combined tables of this sort should be considered pre-joined tables and
treated accordingly. Tables with one to one relationships should always be analyzed to
determine if combination is useful.

Redundant Data
Sometimes one or more columns from one table are accessed whenever data from another
table is accessed. If these columns are accessed frequently with tables other than those in
which they were initially defined, consider carrying them in those other tables as
redundant data. By carrying these additional columns, joins can be eliminated and the
speed of data retrieval will be enhanced. This should only be attempted if the normal
access is debilitating.
Consider, once again, the DEPT and EMP tables. If most of the employee queries require
the name of the employee's department then the department name column could be
carried as redundant data in the EMP table. The column should not be removed from the
DEPT table, though (causing additional update requirements if the department name
changes).
In all cases columns that can potentially be carried as redundant data should be
characterized by the following attributes:

 only a few columns are necessary to support the redundancy


 the columns should be stable, being updated only infrequently
 the columns should be used by either a large number of users or a few very 51

important users
Repeating Groups
When repeating groups are normalized they are implemented as distinct rows instead of
distinct columns. This usually results in higher DASD usage and less efficient retrieval
because there are more rows in the table and more rows need to be read in order to satisfy
queries that access the repeating group.
Sometimes, denormalizing the data by storing it in distinct columns can achieve
significant performance gains. However, these gains come at the expense of flexibility.
For example, consider an application that is storing repeating group information in the
normalized table below:

51
Version 1.0 Teradata - CONFIDENTIAL AND PROPRIETARY
Use Pursuant to Company Instructions
This table can store an infinite number of balances per customer, limited only by
available storage and the storage limits of the RDBMS. If the decision were made to
string the repeating group, BALANCE, out into columns instead of rows, a limit would
need to be set for the number of balances to be carried in each row. An example of this
after denormalization is shown below:

In this example, only six balances may be stored for any one customer. The number six is
not important, but the concept that the number of values is limited is important. This
reduces the flexibility of data storage and should be avoided unless performance needs
dictate otherwise.
Before deciding to implement repeating groups as columns instead of rows be sure that
the following criteria are met:

 the data is rarely or never aggregated, averaged, or compared within the row
 the data occurs in a statistically well-behaved pattern
 the data has a stable number of occurrences
 the data is usually accessed collectively
 the data has a predictable pattern of insertion and deletion
If any of the above criteria are not met, SQL SELECT statements may be difficult to code
making the data less available due to inherently unsound data modeling practices. This
should be avoided because, in general, data is denormalized only to make it more readily
available.

Derivable Data
If the cost of deriving data using complicated formulae is prohibitive then consider
storing the derived data in a column instead of calculating it. However, when the
underlying values that comprise the calculated value change, it is imperative that the
stored derived data also be changed otherwise inconsistent information could be reported.
This will adversely impact the effectiveness and reliability of the database.
Sometimes it is not possible to immediately update derived data elements when the
52 columns upon which they rely change. This can occur when the tables containing the

Teradata - CONFIDENTIAL AND PROPRIETARY


Use Pursuant to Company Instructions Version 1.0
derived elements are off-line or being operated upon by a utility. In this situation, time the
update of the derived data such that it occurs immediately when the table is made
available for update. Under no circumstances should outdated derived data be made
available for reporting and inquiry purposes.

Hierarchies
A hierarchy is a structure that is easy to support using a relational database such as DB2,
but is difficult to retrieve information from efficiently. For this reason, applications which
rely upon hierarchies very often contain denormalized tables to speed data retrieval. Two
examples of these types of systems are the classic Bill of Materials application and a
Departmental Reporting system. A Bill of Materials application typically records
information about parts assemblies in which one part is composed of other parts. A
Department Reporting system typically records the departmental structure of an
organization indicating which departments report to which other departments.
A very effective way to denormalize a hierarchy is to create what are called "speed"
tables. Figure 3 depicts a department hierarchy for a given organization. The hierarchic
tree is built such that the top most node is the entire corporation and each of the other
nodes represents a department at various levels within the corporation. In our example
department 123456 is the entire corporation. Departments 1234 and 56 report directly to
123456. Departments 12, 3, and 4 report directly to 1234 and indirectly to department
123456. And so on.
The table shown under the tree in Figure 3 is the classic relational implementation of a
hierarchy. There are two department columns, one for the parent and one for the child.
This is an accurately normalized version of this hierarchy containing everything that is
represented in the diagram. The complete hierarchy can be rebuilt with the proper data
retrieval instructions.

Figure 3. Classic Relational Implementation of a Department Hierarchy

53

53
Version 1.0 Teradata - CONFIDENTIAL AND PROPRIETARY
Use Pursuant to Company Instructions
Even though the implementation effectively records the entire hierarchy, building a query
to report all of the departments under any other given department can be time consuming
to code and inefficient to process. Figure 4 shows a sample query that will return all of
the departments that report to the corporate node 123456. However, this query can only
be built if you know in advance the total number of possible levels the hierarchy can
achieve. If there are n levels in the hierarchy then you will need n-1 UNIONs.

Figure 4. Querying the Departmental Hierarchy

A "speed" table can be built such as the one in Figure 5. This table depicts the parent
department and all of the departments under it regardless of the level. Contrast this to the
previous table which only recorded immediate children for each parent. A "speed" table
also commonly contains other pertinent information that is needed by the given
application. Typical information includes the level within the hierarchy for the given
node, whether or not the given node of the hierarchy is a detail node (at the bottom of the
tree), and, if ordering within level is important, the sequence of the nodes at the given
level.

54

Teradata - CONFIDENTIAL AND PROPRIETARY


Use Pursuant to Company Instructions Version 1.0
Figure 5. Speed Table Implementation of a Departmental Hierarchy

After the "speed" table has been built, speedy queries can be written against this
implementation of a hierarchy. Figure 6 shows various informative queries that would
have been very inefficient to execute against the classical relational hierarchy. These
queries work for any number of levels between the top and bottom of the hierarchy.
A "speed" table is commonly built using a program written in COBOL or another high
level language. SQL alone is usually either too inefficient to handle the creation of a
55
"speed" table or impractical because the number of levels in the hierarchy is either
unknown or constantly changing.

55
Version 1.0 Teradata - CONFIDENTIAL AND PROPRIETARY
Use Pursuant to Company Instructions
Figure 6. Querying the Speed Table

Types of Denormalization
We have discussed nine different types of denormalization. The table below will
summarize the types of denormalization that are available with a short description of
when this type of denormalization is useful.

Summary
The decision to denormalize should never be made lightly because it involves a lot of
administrative dedication. This dedication takes the form of documenting the
denormalization decisions, ensuring valid data, scheduling of data migration, and keeping
end users informed about the state of the tables. In addition, there is one more category of
administrative overhead: periodic analysis.
Whenever denormalized data exists for an application the data and environment should be
periodically reviewed on an on-going basis. Hardware, software, and application
requirements will evolve and change. This may alter the need for denormalization. To
verify whether or not denormalization is still a valid decision ask the following questions:
Have the processing requirements changed for the application such that the join criteria,
timing of reports, and/or transaction throughput no longer require denormalized data?
Did a new DBMS release change performance considerations? For example, did the
introduction of a new join method undo the need for pre-joined tables?
56

Teradata - CONFIDENTIAL AND PROPRIETARY


Use Pursuant to Company Instructions Version 1.0
Did a new hardware release change performance considerations? For example, does the
upgrade to a new high-end processor reduce the amount of CPU such that
denormalization is no longer necessary? Or did the addition of memory enable faster data
access enabling data to be physically normalized?
In general, periodically test whether the extra cost related to processing with normalized
tables justifies the benefit of denormalization. You should measure the following criteria:

 I/O saved
 CPU saved
 complexity of update programming
 cost of returning to a normalized design
It is important to remember that denormalization was initially implemented for
performance reasons. If the environment changes it is only reasonable to re-evaluate the
denormalization decision. Also, it is possible that, given a changing hardware and
software environment, denormalized tables may be causing performance degradation
instead of performance gains.
Simply stated, always monitor and periodically re-evaluate all denormalized applications.

57

57
Version 1.0 Teradata - CONFIDENTIAL AND PROPRIETARY
Use Pursuant to Company Instructions
References

This attachment describes pysical database design tips when interfacing with the Teradata
CRM product.

58

Teradata - CONFIDENTIAL AND PROPRIETARY


Use Pursuant to Company Instructions Version 1.0

You might also like