Chapter 4

Chapter 4 ² The Extended Logical Data Model "Too much of a good thing is just right."
Too much of a good thing is just right, but even a little of a bad thing will mess up your data warehouse. We have arrived at building the Extended Logical Data Model. The Extended Logical Data Model will serve as the input to the Physical Data Model. We are in a critical stage. We must produce excellence in this section because we don¶t want bad input going into our Physical Data Model. This comes from Computer Science 101 -Garbage In ± Garbage Out (GIGO). Building a poor Extended Logical Data Model makes about as much sense as Mae West running east looking for a Sunset! Things will be headed in the wrong direction. If however, you can produce quality in your ELDM, your warehouse is well on its way to being just right! This chapter will begin with the Application Development Life Cycle, and then talk about the Logical Model and Normalization. From there, we get to the meat and discuss the metrics, which is a critical part of the Extended Logical Data Model. The ELDM will become our final input into the Physical Data Model.

The Application Development Life Cycle

"Failure accepts no alibis. Success requires no explanation."
The design of the physical database is key in the success of implementing a data warehouse along with the applications that reside on the system. Whenever something great is built, whether a building, an automobile, or a business system, there is a process that facilitates its development. When designing the physical database, the process to conduct this is known as the Application Development Life Cycle. If you follow the process you will need no alibis because success will be yours. The six major phases of this process are as follows: yDesign - Developing the Logical Data Model (LDM), Extended Logical Data Model (ELDM), and the Physical Data Model (PDM). yDevelopment - Generating the queries according to the requirements of the business users. yTest - Measure the impact that the queries have on system resources. yProduction - Follow the plan and move the procedures into the production environment. yDeployment - Training the users on the system application and query tools (i.e., SQL, MicroStrategy, Business Objects, etc.). yMaintenance - Re-examining strategies such as loading data and index selection. It is important that you understand the fundamentals and which order to perform them in the development life cycle of an application. First there is the Business Discovery, then second is a Logical Data Model, then third

Chapter 4

you get an outstanding Physical Model, then fourth Design the Application, and then lastly you perform your Development and Assurance Testing. During the testing phase of an application it is important to check that Teradata is using Parallelism, there are no large spool space peaks, and that AMP utilization is equal. A HOT AMP is a bad sign and so is running out of Spool.

Asking the Right Questions

"He who asks a question may be a fool for five minutes, but he who never asks a question remains a fool forever."
The biggest key to this section is knowledge and not being afraid to ask the right questions. Knowledge about the user environment is vitally important. If you can ask the right questions, you will build a model that will map to the users needs. In addition, you will be able to deliver a world-class data warehouse that remains cool forever. Here is how this works. The logical modelers will create a logical data model. Then it is up to you to ask the right questions and find out about the demographics of the data. Only then you can build the proper Physical Data Model. Remember: The Logical Data Model will be the input to the Extended Logical Data Model. The Extended Logical Data Model will be input to the Physical Data Model. The Physical Data Model is where Denormalization and the advantage of parallelism are determined.

Logical Data Model

"When you are courting a nice girl an hour seems like a second. When you sit on a red-hot cinder a second seems like an hour. That¶s relativity."
The first step of the design phase is called the Logical Data Model (LDM). The LDM is a logical representation of tables that reside in the data warehouse database. Tables, rows and columns are the equivalent of files, records and fields in most programming languages. A properly normalized data model allows users to ask a wide variety of questions today, tomorrow, and forever.
The following illustration displays the Employee Table. The columns are emp, dept, lname, fname and sal:

Chapter 4

Employee Table EMP Primary Key 1 2 3 4

DEPT Foreign Key 40 20 20 10






95000.00 70000.00 55000.00 34000.00

Notice that each of the four rows in the employee table is listed across all of columns and each row has a value for each column. A row is the smallest unit that can be inserted, or deleted from a table in a data warehouse database.

Primary Keys

"Instead of giving politicians the keys to the city, it might be better to change the locks."
The Primary Key of a table is the column or group of columns whose values will identify each row of that table. Every table has to have a primary key: Tables are very flexible when it comes to defining how a table¶s data can be laid out. However, every table must have a primary key. Each row within that table must always be uniquely identifiable. Every table can only have one primary key: If the table happens to have several possible combinations that could work as a primary key, only one can be chosen. You cannot have more than one primary key on a table. The smallest group of columns, often just one, is usually the best. PK means primary key: Primary keys will be marked with the letters PK.

FK will stand for your foreign keys. Foreign keys help to relate a group of rows to another group of rows. . FK2. all men mispronounce it." A foreign key is a column or group of columns that happens to be a primary key in another table. and these groups of rows can be found on the same or multiple table(s). etc). FK means foreign key: When drawing out the design of your tables. The picture below shows the joining of the Employee Table and the Department Table." Primary Key Foreign Key relationships establish a relation therefore the tables can be joined.Chapter 4 "Life is a foreign language. Primary Key Foreign Keys Establish a Relationship "One of the keys to happiness is bad memory. They can also be numbered so that you can properly mark multi-column foreign keys (FK1. Both groups of rows are required to have a common column containing like-data so that they can match up with each other.

redundancy is minimized. and update anomalies are vaporized! Industry experts consider Third Normal Form (3NF) mandatory. For example. don¶t strive for perfection. however it is not an absolute necessity when utilizing a Teradata Data Warehouse. aim for excellence. Teradata has also been proven to be extremely successful with Second Normal Form implementations. This is a mistake. 2. 3. The first four forms are described as follows: 1. When a table is normalized from first to second normal form.Chapter 4 "The noblest search is the search for excellence. . attributes must relate to the entire primary key (like second normal form). All tables with a single column Primary Key are considered second normal form. The interesting thing about a normalized model is that at each level." The term normalizing a database came from IBM¶s Codd. When creating a logical data model. consistency is maintained. Third Normal Form (3NF) states that all columns/attributes relate to the entire primary key and no other key or column. not just a portion of it. And do it quickly! Normalizing a database is a process of placing columns that are not key related (PK/FK) columns into tables in such a way that flexibility is utilized. and attributes must relate to the primary key and not to each other (like third normal form). in first normal form there are no repeating groups. He was intrigued by Nixon¶s term of trying to normalize relations with China so he began his search for excellence to normalize relations between tables. Most modelers believe the noblest search is the search for perfection so they often can take years to perfect the data model. First Normal Form (1NF) eliminates repeating groups. for each Primary Key there is only one occurrence of the column data. Second Normal Form (2NF) is when the columns/attributes relate to the entire Primary Key. He called the process normalization. In third normal form it will have no repeating groups (like first normal form). it will have no repeating groups and attributes must relate to the entire primary key.

locations. products and stores. related. place or thing in simple terms.Chapter 4 Once the tables are defined. place. then this is the start of a great table. or anything of that matter that can be uniquely identified. departments. A great table starts by tracking a great entity. the Logical Data Model is complete. Entities are nouns that we want to track." An entity represents a person. We will want to track our favorite nouns such as employees. We mentioned earlier in the book that a noun is the name of a type of a person. A great building starts with a great foundation. The next phase involves the business users. and normalized. "Examine what is said. If a business can name a noun that it wants to keep track of. customers. not him who speaks. which establish the foundation for creation of the Extended Logical Data Model. . Read on as we discuss this next step of the design process of Teradata.

Operational systems must track everything that the business is doing. The approach to modeling is the same for operational systems as a data warehouses Decision Support System (DSS). When trying to identify an entity. Figuring what will bring the best ROI as you start the warehouse is extremely important. but data warehouses usually start small and once the Return On Investment (ROI) is realized they continue to grow. but the difference lies in the fact that we want to track what will give us the most bang for the buck in our DSS warehouse and once we make money off this we will expand our tracking. the best thing to do first is to establish whether or not it¶s a major entity or a minor entity. . Management understands that money doesn¶t grow on trees and you may only get one chance to fund a data warehouse project." A data model of an operational system will often be different than the data model of a data warehouse.Chapter 4 Entities of Operational Systems Vs Decision Support "It¶s time for the human race to enter the Solar System.

Chapter 4 Examples of Major and Minor Entities .

you can only have 1 department assigned to an employee." A one-to-one relation is found when each occurrence of one entity (entity A) can be related to only one occurrence of another entity (entity B). an action. but you can have multiple employees assigned to a department. Because there is only one China and only one United States of America. or an event that will tie two or more entities together. you choose your friends.Chapter 4 "Fate chooses your relations. you can find multiple occurrences relating to Entity A. When they discover the center of the universe. Relations come in three different forms. and cruelly mocked. and may not be found in most data models you create. ." A relation can be a state of being. or none at all. This is the rarest form of relations. but it cannot be taken away unless it is surrendered. But when it comes to each occurrence of entity B. and will appear in almost all models. The same applies when you relate entity B to entity A. President Nixon termed what he was doing as normalizing relations with China." A one-to-many relation is found when each occurrence of one entity (entity A) can be related to only one occurrence of another entity (entity B). their normalizing of relations could be considered a one-to-one relationship. For example. one-to-many and many-to-many. Codd of IBM. The term Normalization in a relational database actually comes from Dr. a lot of people will be disappointed to discover they are not in it. He was inspired by President Richard Nixon who at the time was trying to build a positive relationship with China. One¶s dignity may be assaulted. No matter what the form of the relation is the tables being related will have a Primary Key Foreign Key relationship. an association. vandalized. one-to-one. This is a very common relation found in tables.

The same will apply when you relate entity B to entity A." A many-to-many relation is found when each occurrence of on entity (entity A) can be related to many different occurrences of another entity (entity B). . You can¶t get eight cats to pull a sled through snow.Chapter 4 "Cats are smarter than dogs. This is also a very common form of a relation.

Chapter 4 "A lot of people approach risk as if it¶s the enemy when it¶s really fortune¶s accomplice. ." Because a Many-to-Many relationship does not have a direct Primary Key Foreign Key relationship an associative table is utilized as the middle man. "The surprising thing about young fools is how many survive to become old fools." As you can see from our example below we were able to join our Many-to-Many relationships table via the associative table. Student Course Table. The associative table has a multi-column Primary Key. One of the associative tables Primary Key columns is the Primary Key of table A and the other Primary Key column of the associative table is the Primary Key of table B. The example below shows the exact syntax for joining the Student Table to the Course Table via the associative table.

we can organize them into a committee ± that will do them in. which isn¶t bad for a guy who¶s only read two." . "This is the sixth book I¶ve written." An attribute is a characteristic of an entity or a relation describing its character. amount or extent.Chapter 4 "If computers get too powerful.

"He¶s the kind of guy who lights up a room just by flicking a switch. however it only exists as part of some other entity. ." A dependent of an entity is a noun and it could have attributes of its very own. This entity is called the parent of the dependent. We call the entity as a whole the µSuperset¶.Chapter 4 A subset of an entity is a group of occurrences of an entity that have a common interest or quality that separates the group from the rest of the occurrences of that entity.

"My idea of an agreeable person is a person who agrees with me. A recursive relation is a relation between variable occurrences of an entity. old age is always 15 years older than I am. which is why it can be complex. This also extends to relations to subsets of entities and dependents of entities. which are called recursive relations.Chapter 4 "To me. ." The second type of special cases is called complex relations." In this chapter we¶re going to learn about special cases of the relation section. This also includes subsets and dependents of entities. A complex relation is a relation that¶s shared between more than two entities.

A Foreign Key is another column in the table that is also a Primary Key in another table. All of the columns in the table should relate to the Primary Key. The Primary Key Foreign key relation is how joins are performed on two tables."The third scenario of special relations is the time relation. Most normalized databases will have many tables with fewer columns. a subset. . and a time value.Chapter 4 "They always say time changes things." A normalized data warehouse will have many different tables that are related to one another. This type of scenario is a relation between an entity. Each table created will have a Primary Key. or a dependent. Tables that have a relation can be joined. A Normalized Data Warehouse "The reputation of a thousand years may be determined by the conduct of one hour. but you actually have to change them yourself. This provides flexibility and is a natural way for business users to view the business.

Dimensional Modeling often implements fewer tables and can be adapted to enhance performance. Dimensional Modeling was originally designed for retail supermarket applications because their systems did not have the performance to perform joins." Never insult a modeler until after you have the ERWin diagram in hand. This is because dimensional modeling was designed around answering specific questions. full table scans. A combination of both is an excellent strategy. Many believe that a normalized model is best while others argue that dimensional models are better. "Never insult an alligator until after you have crossed the river. aggregations. .Chapter 4 Relational Databases use the Primary Key Foreign Key relationships to join tables together. and sorting.

" The Dimension Table ± The dimension table helps to describe the measurements on the fact table. When a query or report is requested. Dimension tables tend to be relatively shallow when it comes to the average number of rows within a dimension table. groupings. Product_Id and Customer_Number will have to be available as dimension attributes. attributes can be identified as the words following µby¶. Dimensions also implement the user interface to the data warehouse. The dimension table can and will contain many columns and/or attributes. . Each dimension is defined by its primary key. a user may ask to see last week¶s total sales by Product_Id and by Customer_Number. Dimension Table attributes are key to making a data warehouse usable and understandable.Chapter 4 v "Facts are stupid things. which serves as the basis for referential integrity with any fact table to which it is joined. For example. Dimension attributes serves as the source of query constraints. and report labels.

we have the four-step Dimensional Modeling Process. and will reduce overhead on data management and disk storage. but rather as a measured fact. every attribute must be listed.Chapter 4 The best attributes tend to be textual and discrete. It would be wise to label your attributes as real words. packaging type. If you¶re having trouble thinking of actions your company is doing. customer. long description. To properly build a dimensional model. For example. The steps go as follows: 1) Select the business process to model: A process is any business action done by a company that is recorded by their data-collection system. the price of certain products is always changing. People whose job depends on being able to analyze the data will know exactly what the key business processes are. "Just because something doesn¶t do what you planned it to do doesn¶t mean it¶s useless. Remember. 2) Define the grain of said business process: Defining the grain means that you¶re listing exactly what each fact table row represents. Dimensional tables contain rich sets of values. and employee. instead of abbreviations. If a case like that arises. Because the standard cost of gasoline is continuously changing. it¶s possible to consider the price of gasoline not as a constant attribute. Just go back to step two. simply ask around. If we create our model by splitting up processes by department. category name. depending on how the designer is developing his dimensional model. and dimensional attributes that give users the tools to cut and analyze their data in every which way. yet accurately. Only publish data once. This will reduce ETL development. A common mistake amongst a team of database developers is that they will not agree on table granularity. Not all numeric data has to be put into a dimensional table. rather than processes by business. Fortunately for us. brand name. the role of a dimensional table is to identify and represent hierarchical relationships within a business. it¶s possible that the data can go into either the fact table or the dimensional table. For each dimension. I¶m looking at you!). no matter how minute or large. columns. 3) Pick any dimension that applies to each fact table row: It¶s imperative that our fact tables are full of a set of dimensions that represent all possible descriptions that take on single values within the context of each measurement. that¶s ok! It¶s happened to everyone (yeah. The four-step Dimensional Modeling process enables us to slowly. If you find in the next two steps that you¶ve improperly determined the grain. and several other product characteristics. A typical dimensional attribute will include a short description. Typical dimensions are date. Sometimes there will be data that¶s questionable. . You just can¶t string together a few tables and hope the system¶s fast as lightning. come up with a better grain. and go from there. This will help to convey any level of detail linked to fact table measurements. packaging size. Dimensional modeling tries to limit duplicate data as much as possible. a battle plan is needed. Other forms of database modeling typically focus on process by department. it will become a major problem in the future." Thomas Edison wasn¶t a big fan of Dimensional Modeling. gas will work. If grain is improperly defined. This is not good. map out our model. there will be many areas containing data duplication using different labels.

Chapter 4 "The weak can never forgive. Each table will vary. Forgiveness is the attribute of the strong." Each dimension on a dimensional model is going to have attributes that make that table unique from the rest. depending on what information is on which table." The following two pictures represent an Entity-Relational (ER) Model and a Dimensional Model (DM): . Attributes will look something like the following: "People can have the Model T in any color ± so long as it¶s black.

Metadata is able to use the cardinality of values within a dimension to help guide user-interface behavior. These situations can include: - Slow-changing dimensions involving dimension tables that change slowly over time. Another way to look at it is if you don¶t have any aggregates. Because the framework is predictable. however the business may want to look at single transactions as well as a report of revenue regularly. the dimensional model is very predictable. and user interfaces. Pay-in-advance databases where transactions of the company are more than small revenue accounts. A star-join schema tends to fit the needs of users who have been interviewed. First off. A well-formed strategy for comprehensive aggregates is needed for any medium to large data warehouse implementation. - - The last strength of the dimensional model is the growing pool of DBA utilities and software processes that regulate and use aggregates. and because a) not . an aggregate is considered summary data that is logically redundant within the data warehouse and are used to enhance query performance. There are several standard approaches for handling certain modeling situations. this doesn¶t mean that the dimensional model is perfect. depending on the company. you will not benefit from these tools." Dimensional Modeling contains a number of warehousing advantages that the ER model lacks. If you stick to ER modeling. The database engine is able to make strong assumptions about constraining dimension tables and then linking to the fact table all at once along with the Cartesian product of the dimension table keys that satisfy the user¶s constraints. query tools. Dimensional modeling helps to provide techniques for handling these slow-changing dimensions. it¶s possible that it doesn¶t fit all users. which in turn is dependant of the dimensional model. "The problem with political jokes is that they get elected. A starjoin schema with just one shape and a set number of constraints can be extremely optimal for one group of users. The star-join design tends to optimize the access of data for solely one group of users at the expense of everyone else. Each situation has a wellunderstood set of alternative decisions that can be programmed in report writers.Chapter 4 "The man who is swimming against the stream knows the strength of it. and horrendous for another. There will always be users who don¶t contribute to the dimensional modeling process. and to make more efficient processing. it strengthens processing. Query tools and report writers are able to make strong assumptions about the dimensional model to make user interfaces more understandable. Remember. There are several situations where an ER model is better than the dimensional model." The dimensional model can fit a data warehouse very nicely. Miscellaneous products where a business needs to track a number of different lines of business within a single common set of attributes and facts. Aggregate management software packages and navigation utilities depend on a specific single structure of the fact and dimension tables. lots of money could end up being wasted on minuscule hardware upgrades. This approach enables the user to evaluate arbitrary n-way joins to a fact table within a single pass through the fact table¶s index. Star-join schemas are shaped around user requirements. However. and vice versa.

while the marketing department might look at things in terms of states. Accounting users look at things in terms of quarterly revenue. When you have multiple independent star-join schema environments. many situations can arise at any time." Each department of a company tends to conduct their business differently than other departments. Because of the need for each department to view things in their own unique way. A source system will feed one star join while another source system feeds another. different star-schemas will be optimal for different groups of users. each department requires a different star-join schema for their data warehouse. Granularity. which is why each department sees things differently from others. but even then certain problems will arise. It is possible to design a star-join schema specifically for each department. The sales department considers a sale as closed business. while the accounting department sees it as booked business. Each department tends to care for different aspects of the company. The data warehousing industry has long-since discovered that a single database will not work for all purposes. - - - - - - The differences between certain business operations and others are much more vast than the short list above. the same detailed data will appear on each star-join. Because there are many types of users on a database. Finance users look at things in terms of monthly and yearly revenue. Each department sees an area of the data warehouse differently than the others because of their unique and different user requirements. A single star-join schema will never fit everyone¶s requirements in a data warehouse. Because of this: . The reality of the business world is that one star-join schema will not fit all aspects of a company¶s data warehouse. Products. while accounting looks at the world through closing dates Sources of data. Geography. and b) user requirements vary from group to group. The sales department looks at things in term of ZIP code. There are several reasons why companies need their own star-join schemas and can¶t share a star-join schema with another company: - Sequencing of data. Most of the problems don¶t even become apparent until multiple star joins have been designed for the data warehouse. while the finance department tends to look at things in terms of existing products. and on any subject. Data will no longer be reconcilable. "A problem well stated is a problem half-solved. Sales tend to look things in terms of future products. Data definitions. Finance users love to see data sequenced one way while marketing users love to see it sequenced another way. A star join schema that is optimal for the finance department is practically useless for marketing. Finance looks at the world through calendar dates.Chapter 4 every user can be interviewed during the dimensional model design phase. and any new star-join creation will require the same amount of work as the old star-joins. Time.

In other words. sizing. "Choice. y DATA DEMOGRAPHICS. the transactions that will occur. not chance. This knowledge in turn brings definition on how the warehouse will be utilized. new star joins will be built on a data warehouse with no foundation. and even access. dimensional modeling as a basis for a data warehouse can lead to many problems when multiple star joins are involved. this input provides clues about the queries that will be used to access the warehouse." . is extremely important. Most warehouses today take advantage of both and will find ways to implement both theories into their database. along with the frequency at which a column will appear in the WHERE clause. The completed Logical Data Model is used as a basis for the development of the Extended Logical Data Model (EDLM). Leave nothing to chance. New star joins will require just as much work to create as previous star joins already in the data warehouse. Users are very knowledgeable about the questions that need to be answered today. - - Because of these conclusions. Star joins can become unnecessarily large when every star join thinks that is has to go back and gather the same data that the other star joins have already collected. it becomes apparent that the dimensional model has many limitations. Inside the ELDM. In the end. In addition. This information.Chapter 4 - Every star join will contain much of the same data as others. The interface that supports applications that feed star joins will become unmanageable. "The only true wisdom is in knowing you know nothing. Dimensional modeling is great for pulling information out of data. The ELDM contains data demographics. The SAMPLE function can also be helpful when collecting data demographics. When each star join is built independently. You are in the process of determining the final outline of your data warehouse. as well as the backbone for the choices you will make in the Physical Data Model. and the rate at which these transactions occur. The ELDM includes information regarding physical column access and data demographics for each table." The information you gather to create the Extended Logical Data Model will provide the main source of information. The ability to correctly distinguish the write data from the wrong data will be nearly impossible. It will never become apparent that there is a problem with a star join when you¶re looking at just one star join. How do we find this information? The best way to resolve this is by running queries that use aggregate functions. The successful completion of the ELDM depends heavily on input from the business users. Dimensional and 3NF modeling work well together and both have their benefits. But when a database contains multiple star joins. each column will have information concerning expected usage and demographic characteristics. while 3NF is great for the quick retrieval of that data. data distribution. The two major components of an ELDM are the following: yCOLUMN ACCESS in the WHERE Clause or JOIN CLAUSE. This also serves as input to the implementation of the Physical Data Model. it maps transactions and applications to their prospective related objects. The results from each star join will be inconsistent with the results of every other star join. determines destiny.

Count the DATA so you can make the DATA count. Understand how column(s) are accessed. make the days count. then Socrates had you in mind on the last three words of his quote." Mohammed Ali has given you some advice that is reflective of your next objective. and your warehouse is on its way to providing true wisdom! COLUMN ACCESS in the WHERE CLAUSE: Value Access Frequency Value Access Rows * Frequency Join Access Frequency Join Access Rows How frequently the table will be accessed via this column. Remember this golden rule! The Primary Index is the fastest way to access data and Secondary Indexes are next. The Primary Index is about distribution. then the data will distribute evenly (which it true). DATA DEMOGRAPHICS: . It is also important to know what columns are joining the tables together. The number of rows that will be accessed multiplied by the frequency at which the column will be used. So how do you accomplish this task? The following will be your guide: y Write SQL and utilize a wide variety of tools to get the data demographics. However. yCombine these demographics with the column and join access information to complete the Extended Logical Data Model. Quite often. How frequently the table will be joined to another table by this column being used in the WHERE clause. If you make a column your Primary Index that is never accessed in the WHERE clause or used to JOIN to other tables. and find out what columns are being used in the WHERE clause for the SELECTs. Secondary Indexes. The number of rows joined. but even more important is Join Access. It will be your job to interview users. but common sense plays a big part in join decisions. yThen use this information to create the Primary Indexes. look at applications.Chapter 4 Follow Socrates advice and assume you know nothing when starting this portion of the design process. They assume that if they keep the Primary Index the same as the Primary Key column (which is unique by definition). Don¶t count the days. new designers to Teradata believe that selecting the Primary Index will be easy. The key here is to investigate what tables are being joined. They just pick the column that will provide the best data distribution. You will be able to find some join information from the users. and other options in the Physical Database Design.

and 9 describing columns that change with every write operation. index and column demographics.000 Employees): . During the first pass at the table you should pick your potential Primary Indexes. The reason for the Extended Logical Data Model is to provide input for the Physical Model so that the Parsing Engine (PE) Optimizer can best choose the least costly access method or join path for user queries. The optimizer will look at factors such as row selection criteria. At this point in time.Chapter 4 Distinct Values Maximum Rows per Value Typical Rows per Value The total number of unique values that will be stored in this column. A great Primary Index will have: yA Value Access frequency that is high yA Join Access frequency that is high yReasonable distribution yA change rating below 2 The Example Below illustrates an ELDM template of the Employee Table (assuming 20. A relative rating for how often the column value will change. The number of rows that will have the most popular value in this column. We can now use this to pick our Primary Indexes and Secondary Indexes. Maximum Rows NULL Change Rating Data Demographics answer these questions: y How evenly will the data be spread across the AMPs? yWill my data have spikes causing AMP space problems? y Will the column change too often to be an index? Extended Logical Data Model Template Below is an example of an Extended Logical Data Model. and for the best index choices. The number of rows with NULL values for the column. with 0 describing columns that do not change. Label them UPI or NUPI based on whether or not the column is unique. don¶t look at the change rating. The Value Access and Data Demographics have been collected. The typical number of rows for each column value. The value range is from 0-9.

Read on. . You don¶t justify an UPGRADE because of a slow Year-End or Quarter End Report! You justify an upgrade if you have done due diligence on the physical model and have reached the point in time when your system is not performing well on a daily basis. Teradata needs to be designed to perform best on a daily basis. We are arriving at the top of the mountain. even a mountain will be worn down over time. The biggest keys to a good physical model are choosing the correct: yPrimary Indexes ySecondary Indexes yDenormalization Tables yDerived Data Options The physical model is important because that is the piece that makes Teradata perform at the optimum level on a daily basis. If you have done a great job with the physical model Teradata should perform like lightning. this is becoming interesting! The Physical Data Model "Nothing can stand against persistence. It is now time to create the Physical Database Design model. If you have done the job on the physical model and Teradata is not performing to your anticipated speed then you might want to get an upgrade." We have arrived at the moment of truth.Chapter 4 EMP PK & FK ACCESS Value Acc Freq Join Acc Freq Value Acc Rows PK SA 6K 7K 70K DEPT FK 5K 6K 50K LNAME FNAME SAL 100 0 0 0 0 0 0 0 0 DATA DEMOGRAPHICS Distinct Rows Max Rows Per Value Max Rows Null Typical Rows Per Value Change Rating 20K 1 0 1 0 5K 50 12 15 2 12K 1K 0 3 1 N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A Once we have our table templates we are ready for the Physical Database Design. Remember.

Improved performance is an admirable goal. however." Denormalization is the process of implementing methods to improve SQL query performance. . Denormalization will always reduce the amount of flexibility that you have with your data and can also complicate the development effort. If there are no Join Columns that survived the Distribution Analysis. know your environment and your business priorities. Denormalization "Most databases denormalize because they have to. If it is then it is a great candidate for a primary or secondary index. This is why the most important factor when picking a Primary Index is Join Access Frequency. UNIQUENESS for distribution purposes. but two tables are usually joined in exactly the same manner by everybody. one must be aware of the hazards of denormalization. Make sure statistics will be collected properly. then this is your Primary Index of choice. If the Distribution for a Column is BAD. Others believe that denormalization has a positive effect on application coding because some feel it will reduce the potential for data problems. For example. it will also increase the risk for data anomalies. it could also take on extra I/O and space overhead in some cases. Then make sure you have chosen your Primary Indexes based on user ACCESS. and that your Primary Indexes are stable values (Change Rating low). Before you go crazy denormalizing remember these valid considerations for optimal system performance first or you are wasting your time. Also. then pick the column with the best distribution as your Primary Index. You still need to look if the index will distribute well. it has already been eliminated as a Primary Index Candidate. but Teradata denormalizes because it wants to. most often the performance benefits of secondary indexes in OLTP environments outweigh performance costs of Batch Maintenance.Chapter 4 No two minds are alike. you can always create a secondary index for it. Will performance improve significantly and does the volatility factor make denormalization worthwhile? It is in the physical model that you can also determine the places to denormalize for extra speed. If you have a column that has a high Value Access Frequency. With poor distribution you are at risk of running out of Perm or Spool. If all of the above fail or two columns are equally important. In addition. If you have a column with high Join Access Frequency. then pick the column with the best Value Access as your Primary Index. The biggest keys to consider when deciding to denormalize a table are PERFORMANCE and VOLATILTITY. Then you must look at Value Access Frequency to see if the column is accessed frequently. Lastly.

The 4 key factors for deciding whether to calculate or store stand alone derived data are: y y y y Response Time Requirements Access Frequency of the request Volatility of the column Complexity of the calculation Response Time Requirements ± Derived data can take a period of time to calculate while a query is running. but if there are many users requesting the information daily. If the calculation takes a long time to calculate and you don¶t have the time to wait then you might consider placing it in a table. If the data never changes and you can run the query one time and store the answer for a long period of time then you may want to consider denormalizing. then be formal and stay with normal. then there is no reason to store the data in another table or temporary table. Access frequency of the request ± If one user needs the data occasionally then calculate on demand. If there are several requests for derived data. For instance. Derived Data Derived data is data that is calculated from other data. Complexity of the calculation ± The more complex the calculation the longer the request may take to process. . and the data is relatively stable. Volatility of the column ± If the data changes often. then you might consider denormalizing to speed up the request. It is a great idea whenever you denormalize from your logical model to include the denormalization in "The Denormalization Exception Report". The key word here is performance.Chapter 4 Either way you should consider denormalization if users run certain queries over and over again and speed is a necessity. This report keeps track of all deviations from 3rd normal form in your data warehouse. If there is no need for speed. When you look at the above considerations you begin to see a clear picture. If the game stays the same there is no need to be formal ± make it denormal. then consider denormalizing so many users can be satisfied. then denormalize and make sure that when any additional requests are made the answer is ready to go. It is important to be able to determine whether it is better to calculate derived data on demand or to place this information into a summary table. taking all of the employee¶s salaries and averaging them would calculate the Average Employee Salary. Performance for known queries is the most complete answer. If user requirements need speed and their requests are taking too long.

Yes. A great example might be this: Let¶s say you have a table that has 120.00 30000.Chapter 4 Temporary Tables Setting up Derived. Your queries in theory will run twelve times faster. Volatile or Global Temporary tables allow users to use a temporary table during their entire session. This is a technique where everyone wins. insert only the month you need to calculate.Department Table DEPT PK 10 20 30 40 DEPT_NAME Human Resources Sales Finance Information Technology TABLE 3 ± Dept_Salary Table (Temporary Table) . It is a table that tracks detail data for an entire year.000 rows.00 20000.00 70000.00 55000. the data in the temporary table goes away.Employee Table EMP PK 1 2 3 4 5 6 DEPT LNAME FK 40 BROWN 20 20 40 10 30 JONES NGUYEN JACOBS SIMPSON FNAME CHRIS JEFF XING SHERRY SAL 65000.000.00 MORGAN 40000. the number is 120 million rows. and run queries until you logoff the session. TABLE 1 . You have been asked to run calculations on a per month basis.00 HAMILTON LUCY TABLE 2 . After you logoff. You can create a temporary table.

Avg_Sale DECIMAL(7. a volatile table may be utilized multiple times.00 Count_Sal 1 2 1 2 Avg(Sal) 40000 75000 20000 47500 Volatile Temporary Tables Volatile tables have multiple characteristics in common with derived tables. It is restricted to a single query statement at a time. They are materialized in spool and are unknown to the Data Dictionary. LOG ( Sale_Date DATE . This feature allows for additional queries to utilize the same rows in the temporary table without requiring the rows to be rebuilt.2) . However.2) .00 125000.2) ) ON COMMIT PRESERVE ROWS . An example of how to create a volatile table would be as follows: CREATE VOLATILE TABLE Sales_Report_vt.Min_Sale DECIMAL(7.Sum_Sale DECIMAL(9. and in more than one SQL statement throughout the life of a session. Now that the Volatile Table has been created.Max_Sale DECIMAL(7. The table definition is designed for optimal performance because the definition is kept in memory.00 95000. the table must be populated with an INSERT/SELECT statement like the following: .Chapter 4 DEPT 10 20 30 40 Sum_SAL 40000.2) . The ability to use the rows multiple times is the biggest advantage over derived tables. unlike a derived table. They require NO data dictionary access or transaction logging.00 20000.

Global tables require no spool space.Sum_Sale DECIMAL(9. when the table is created.SUM(Daily_Sales) . Then the table and data go away. the definition is stored in the Data Dictionary. Global Temporary Tables Global Temporary Tables are similar to volatile tables in that they are local to a user¶s session. once the table is de-materialized. The LOG option indicates there will be transaction logging of before images. the rows inside the Global Temporary Table will be removed. LOG ( Sale_Date DATE . The information in the table remains for the entire session. An example of how to create a global temporary table would be as follows: CREATE GLOBAL TEMPORARY TABLE Sales_Report_gt. then an explicit DROP command must be executed. Lastly. This allows for future materialization of the same table. However. The create statement of a volatile table has a few options that need further explanation. when a session normally terminates. Because of these reasons.MIN(Daily_Sales) FROM Sales_Table GROUP BY Sale_Date. Users can ask questions to the volatile table until they log off. global tables can survive a system restart and the table definition will not discarded at the end of the session. In addition. How real does Teradata consider global temporary tables? They can even be referenced from a view or macro. the rows in the volatile table will not be deleted. However. If the global table definition needs to be dropped.MAX(Daily_Sales) . However.2) . The ON COMMIT PRESERVE ROWS means that at the end of a transaction.Chapter 4 INSERT INTO Sales_Report_vt SELECT Sale_Date . They use Temp Space.AVG(Daily_Sales) . the definition still resides in the Data Dictionary.Avg_Sale DECIMAL(7. these tables are materialized in a permanent area known as Temporary Space. unlike volatile tables.2) . Users from other sessions cannot access another user¶s materialized global table.

AVG(Daily_Sales) . the table must be populated with an INSERT/SELECT statement like the following: INSERT INTO Sales_Report_gt SELECT Sale_Date .Min_Sale DECIMAL(7.2) ) PRIMARY INDEX(Sale_Date)ON COMMIT PRESERVE ROWS .Chapter 4 .MAX(Daily_Sales) .2) .Max_Sale DECIMAL(7.MIN(Daily_Sales)FROM Sales_Table GROUP BY Sale_Date . .SUM(Daily_Sales) . Now that the Global Temporary Table has been created.

Sign up to vote on this title
UsefulNot useful