You are on page 1of 21

P age |2

1. Introduction
Strong data warehouse performance is critical to keeping users satisfied, attaining service level
agreements (SLAs) and maximizing the return on investment (ROI) in the Teradata system.
Sometimes queries that perform unnecessary full-table scans or other operations that consume too
many system resources are submitted to the data warehouse. Application tuning is a process to
identify and tune target applications for performance improvements and proactively prevent
application performance problems.
Some data warehouses handle millions of queries in a day. This makes it difficult for DBAs to identify
suspect queries. A suspect query is one that either consumes too many system resources or is not
taking advantage of Teradata’s parallelism. Identifying and documenting the frequency of problem
queries offers a more comprehensive view of the queries affecting data warehouse performance and
helps prioritize tuning efforts.
DBQL is a rich resource for performance data, as it provides full SQL text, CPU and I/O by query,
number of active AMPs in a query, spool use, number of query steps and full explain text. It also
offers information to calculate suspect query indicators such as large -table scans, skewing (when the
Teradata system is not using all the AMPs in parallel) and large-table-to-large-table product joins.

Whitepaper | TERADATA PERFORMANCE TUNING

P age |3

2. Teradata Architecture

2.1 The Parsing Engine is responsible for:
 Managing individual sessions (up to 120).
 Parsing and Optimizing your SQL requests.
 Dispatching the optimized plan to the AMPs.
 Sending the answer set response back to the requesting client.
2.2 The Message Passing Layer is responsible for:
 Carrying messages between the AMPs and PE.
 Point-to-Point and Broadcast communications.
 Merging answer sets back to the PE.
 Making Teradata parallelism possible.
2.3 The Amps are responsible for:
 Finding the rows requested.
 Lock management.
 Sorting rows.
 Aggregating columns.
 Join processing.
 Output conversion and formatting.
2.4 Storing Rows:
 The rows of every table are distributed among all AMPs.
 Each AMP is responsible for a subset of the rows of each table.
 Ideally, each table will be evenly distributed among all AMPs.
Whitepaper | TERADATA PERFORMANCE TUNING

 The uniformity of distribution of the rows of a table depends on the choice of the Primary Index.P age |4  Evenly distributed tables result in evenly distributed workloads. Whitepaper | TERADATA PERFORMANCE TUNING .

1 Accessing Via a Unique Primary Index  A UPI access is a one-AMP operation which may access at most a single row Whitepaper | TERADATA PERFORMANCE TUNING .1.  Accessing the row by its Primary Index value is: always a one-AMP operation and the most efficient way to access a row.P age |5 3. Data Distribution 3. 3.  This is done using a hashing algorithm.  Two type UPI (Unique Primary Index) and NUPI (Non Unique Primary Index) .1 Primary Index  The value of the Primary Index for a specific row determines the AMP assignment for that row.

but not always.1. Assures maximum efficiency for parallel operations.P age |6 3. 3. Teradata will distribute different index values evenly across all AMPs.2 Row Distribution Using a UPI     Often. the PK column(s) will be used as a UPI. Resulting row distribution among AMPs is very uniform. Whitepaper | TERADATA PERFORMANCE TUNING .3 Row Distribution Using a NUPI  Rows with the same PI value distribute to the same AMP.  Row distribution is less uniform or “skewed”.1.

 Highly non-unique columns are poor PI choices generally.  Choice of Customer_Number is therefore a NUPI. Whitepaper | TERADATA PERFORMANCE TUNING .1. 3.  The degree of uniqueness is critical to efficiency.P age |7  Customer_Number may be the preferred access column for ORDER table. thus a good index candidate.4 Row Distribution Using a Highly Non-Unique Primary Index (NUPI)  Table will not perform well in parallel operations.  Values for Customer_Number are somewhat non-unique.

 Choice of Order_Status column is a NUPI. so only two AMPs will ever be used for this table.  Only two values exist. Whitepaper | TERADATA PERFORMANCE TUNING .P age |8  Values for Order_Status are ‘highly’ non-unique.

Order_date date. Customer_number integer NOT NULL. then all NULL values will be placed in the partition. Customer_number integer NOT NULL.  Partitions are usually defined based on Range or Case. it only creates partitions on data already distributed based on PI.  PPI’s are defined on a table in order to increase the query efficiency by avoiding full table scans. Order_total<20000. NO CASE OR UNKNOWN).2 Partitioned Primary Index  Unique feature of Teradata which allows access of portion of data of large table. Order_total integer ) PRIMARY INDEX(Customer_number) Partition by case1( Order_total<10000.  If we specify UNKNOWN.P age |9 3. Order_date date. Partition by Case CREATE TABLE Order ( Ord_number integer Not NULL. Order_total<30000. Whitepaper | TERADATA PERFORMANCE TUNING .  If we specify No Range or No Case.  This works by hashing rows to different virtual amps. Order_total integer ) PRIMARY INDEX(Customer_number) Partition by range1( Order_date between date ‘2013-01-01’ AND date’2013-12-01’ Each interval ‘1’ month NO Range OR UNKNOWN). then all the values not in this range will be in single partition.  PPI does not alter data distribution. Partition by Range CREATE TABLE Order ( Ord_number integer Not NULL.

1 Run explain plan       Check for No or low confidence.1 In case of product join scenarios check for      Proper usage of alias.like specifying type of joins (ex.<tablename>. Check for By way of an all row scan . 4. Collect stats on suggestions columns. pilot potential solutions through experimentation and analyze the results. inner or outer).P a g e | 10 4. Whitepaper | TERADATA PERFORMANCE TUNING . Check for Translate. 4. Check for In/not in keywords. Use union in case of "OR” scenarios. Gather information on columns on which stats has to be collected. Check for distinct/group by."help stats <databasename>. To determine the best tuning options. SI or columns used in joins .FTS. Check for Product joins conditions. Also check for stats missing on PI. 4. Joining on matching columns.2 Collect Stats     Run command "diagnostic help stats on for the session". it is important to baseline existing performance conditions.1. Ensure statistics are collected on join columns and this is especially important if the columns you are joining on are not unique. Usage of join keywords . Performance Tuning Thumb Rules These are some best practices will should follow to use Teradata at its best performance.

P a g e | 11     Make sure stats are re-collected when at-least 10% of data changes remove unwanted stats or stat which hardly improves performance of the queries Collect stats on columns instead of indexes since index dropped will drop stats as well collect stats on index having multiple columns. this might be helpful when these columns are used in join conditions  Check if stats are re-created for tables whose structures have some changes Example1: Explain before collecting stats Example2: Explain after collecting stats Whitepaper | TERADATA PERFORMANCE TUNING .

6. COLLECT STATISTICS on Emp_Table INDEX (First_name. Whitepaper | TERADATA PERFORMANCE TUNING . COLLECT STATISTICS on Emp_Table COLUMN(Emp_no. Examples: COLLECT STATISTICS on Emp_Table . The Unique Primary Index of small tables (less than 1. 3.000 rows per AMP) 8. 5. 5. 4. 3.000 rows per AMP) All Non-Unique Primary Indexes and All Non-Unique Secondary Indexes Join index columns that frequently appear on any additional join index columns that frequently appear in WHERE search conditions 7. COLLECT STATISTICS on Emp_Table COLUMN Dept_no . COLLECT STATISTICS on Emp_Table INDEX Emp_no . 4. Dept_no).P a g e | 12  Below information statistics will collect 1. Columns that frequently appear in WHERE search conditions or in the WHERE clause of joins 1. 2. Last_name). 2. The number of rows in the table The average row size Information on all Indexes in which statistics were collected The range of values for the column(s) in which statistics were collected The number of rows per value for the column(s) in which statistics were collected The number of NULLs for the column(s) in which statistics were collected  Which all columns we need to collect stats Primary Index of a Join Index Secondary Indexes defined on any join index Non-indexed columns used in joins The Unique Primary Index of small tables (less than 1. 6.

SQL statement uses a partial value (LIKE. Example1: Explain without condition Whitepaper | TERADATA PERFORMANCE TUNING .3 Avoid Full table scan scenarios  Try to avoid FTS scenarios as. . .SQL statement does not contain the WHERE statement.SQL statement uses inequality operators (<.NULL value in the column results in unknown) .) in the WHERE statement. 4. it might take very long time to access all the data in every amp in the system  Make sure SI is defined on the columns which are used as part of joins or Alternate access path. SHOW SUMMARY STATISTICS VALUES ON Employee_Table.) in the WHERE statement. Write large list values to a temporary table and use temporary tables for computations. Hence this leads to inconsistent results  Some examples of when a Full Table Scan is performed: .P a g e | 13  Table-level statistics known as "summary statistics" are collected whenever column or index statistics are collected.. . avoid using IN /NOT IN in SQLs...  Collect stats on SI columns else there are chances where optimizer might go for FTS even when SI is defined on that particular column  If intermediate tables are used to store results. make sure that it has same PI of source and destination table  For large list of values.The WHERE statement does not use the Primary or Secondary index. . >..  Make sure when to use exists/not exists condition since they ignore unknown comparisons (ex. . .

P a g e | 14 Example2: Explain with condition Whitepaper | TERADATA PERFORMANCE TUNING .

COALESCE.  Else Optimizer will translate the column in driving table to match that of derived table. on the indices used as part of Join  Avoid using functions such as SUBSTR. Tip 3: Do not use functions like SUBSTR. CASE . it will give better the performance. 2.P a g e | 15 5. Volatility The column should not be frequently changed. Whitepaper | TERADATA PERFORMANCE TUNING . 3. Tip 4: Not Null columns  Make sure to use NOT NULL for columns which are declared as NULLABLE in TABLE definition reason being Null values might get sorted to one poor AMP resulting in infamous "NO SPOOL SPACE" error as that AMP cannot accommodate any more Null values  Recommended to use Not Null condition while joining on the nullable columns of a table so that table skew can be avoided.  Might result in product join. spool out issues and opti mizer will not be able to take decisions since no stats are available on the column. Access frequency. COALESCE. Data Distribution.  Optimizer will not be able to read stats on those columns which have functions associated to it as it is busy converting functions.. If the primary index of the table contains less number of null values and more distinct values. Tip 2: Column join must be of same data type  When trying to join columns from two tables.  The following are the important tips while choosing the primary index. You need to analyze the number of distinct values in the table. optimizer makes sure that datatype is same. Example: TABLE employee deptno (char) TABLE dept deptno (integer)  Make sure you are joining columns that have same data types to avoid translation. The column should be that which is frequently used in join process. CASE on the indices used as join. 1. Teradata Performance Tuning Tips Tip 1: What is the criteria to choose best Primary Index?  Be careful while choosing the primary index because it affects the data storage and performance.. The column has to be frequently used in the where clause during the row selection.

P a g e | 16 Tip 5: Usage of Like clause Example: LIKE ‘%SUBIN% will be processed differently from ‘SUBIN %’  In the former. if at all possible.  DISTINCT redistributes the rows immediately. Steps used in each case for elimination of Duplicates: GROUP BY  It reads all the rows part of GROUP BY. Tip 6: Distinct Vs Group by  Both return same number of rows but with some execution time difference between them. it is better to try to use one or more leading character in the clause. GROUP BY will spend more time attempting to eliminate duplicates that do not exist at all.  In the latter.  When data is nearly unique in a table. the optimizer needs to do a full table scan which reduces the performance. it Sorts data to group duplicates on each Amp Will remove all the duplicates on each amp and sends the original /unique value.  If LIKE is used in a WHERE clause. Sorts data to group duplicates on each AMP b. Whitepaper | TERADATA PERFORMANCE TUNING . more data may move between the AMPs whereas GROUP BY that only sends unique values between the AMPs. Once redistribution is completed. Hashes the column value identified in the distinct clause of select statement Then redistributes the rows according to row value into appropriate Amp.  It will remove all duplicates in each AMP for given set of values using "BUCKETS" concept. Will remove all the duplicates on each amp and sends the original/unique value DISTINCT      It reads each row on Amp.  Once redistribution is completed. the optimizer makes use of the index to perform on query thereby increasing the performance.  GROUP BY sorts the data locally on vprocessor while DISTINCT redistribute data then it sorts the data.  Hashes the unique values on each AMP. Hence it is suggested to go for '% SUBIN %' only if SUBIN is a part of entire pattern say 'SUBSTRING'. it a.  Then it will re-distribute them to particular /appropriate AMP's.

 But using "select <all Columns > from table” eliminates this extra stage of verifying and fetching on columns from the table. use CHAR columns to get the performance benefits of fixed-length columns. An extra stage is added where * is replaced by column names by teradata and then it would fetch the data . which would run in parallel.  You should avoid of unnecessary UNIONs they are huge performance leak. use VARCHAR columns Tip 10: Union Vs Union All  The “union” command can be used to break up a large sql process or statement into several smaller sql processes or statements. delete all?  Both return the same result. If speed is your primary concern. by definition. Whitepaper | TERADATA PERFORMANCE TUNING . eliminates all duplicate rows (as opposed to UNION ALL) and is slower.  Tables with fixed-length rows are easier to reconstruct if you have a table crash.  Delete will truncate the data but maintain the index table.  If you are choosing between CHAR and VARCHAR columns.  Delete all will truncate the data as well as index table.  UNION query.  VARCHARs are bad for read performance because each record can be of variable length and that makes it more costly to find fields in a record.  Hence it is always recommended to use "select <all Columns > from table" Tip 8: Difference between delete. If space is at a premium. Tip 9: Variable length columns  The use of variable length columns should be minimized.  But these could then cause spoolspace limit problems. the tradeoff is one of time versus space. As a rule of thumb use UNION ALL if you are not sure which to use.P a g e | 17 Hence it is better to go for GROUP BY : When Many duplicates DISTINCT : When few or no duplicates Tip 7: Which is faster? select * from table or select 'all Columns' from table ??  In case of using "select * from table”.  “Union all” executes the sql’s single threaded.  Fixed length columns should always be used to define tables.

 In some cases.  Then these smaller files are unix concatenated together to provide a single unix file. because this might return unexpected result sets.P a g e | 18 Tip 11: Strategic Semicolon  At the end of every sql statement.  Never use NOT IN on NULLable columns.  If no of records will be more. Tip 12: Unix split OR Unix concatenation Split  A large input unix files could be split into several smaller unix files. Concatenation  A large query could be broken up into smaller independent queries.  But this will not improve an individual sql statement’s time. there is a semicolon. which could then be input in series. the strategic placement of this semicolon can improve the sql time of a group of sql statements. Tip 14: NOT IN Vs NOT EXISTS  There is a huge difference between NOT IN vs. Example: 1) The group’s sql time could be improved if a group of sql statements share the same tables (or spool files) 2) The group’s sql time could be improved if several sql statements use the same unix input file. EXISTS is faster than IN.) Whitepaper | TERADATA PERFORMANCE TUNING . Tip 13: IN Vs EXISTS in Teradata SQL  Performance wise both should be same with less no of records.  Mostly IN is used in case of subqueries and EXISTS is used in case of correlated subqueries. there's a lot of work for the database. (The result set might be empty and even if it's correct. or in parallel. to create smaller SQL processing steps. whose output is written to several smaller unix files. NOT EXISTS  NOT EXISTS simply ignores NULLs.

P a g e | 19 Tip 15: Top Vs SAMPLE  TOP 10 means "first 10 rows in sorted order". it could pick a random point at which to start scanning the table and a number of rows to skip between rows that are returned.  Whenever we insert data into VTT. its Definition is stored into System cache.  Top really comes into good use when you are dealing with larger tables and queries because rather than running the entire query and then returning 'sample' records. Whitepaper | TERADATA PERFORMANCE TUNING .  Whenever we create VTT.  At a very simple level. and then stops the query.  This will free some spool space immediately and could prove to be very helpful in avoiding No More Spool Space error. data dump can happen very efficiently form source to target. So table definition and data both are remains active only up to session end only. Tip 16: Global temporary table vs volatile table  Whenever we create GTT. So definition of the table will be active until we can delete using the drop table statement and data remains active up to end of the session only. top simply picks the first (or 'top') 10 records which have been returned from any node.data is stored into temp space. data is stored into spool space. Tip 18: DROPPING volatile tables explicitly  Once volatile tables are no more required you can drop those.  Whenever we insert data into GTT. its definition is stored into Data Dictionary. Don’t wait for complete procedure to be over.  The optimizer is free to select the cheapest plan it can find and stop processing as soon as it has found enough rows to return.  We can collect statistics on Global temporary tables. for example. as the query runs.(Teradata 13 and above will allow you to Collect Stats on Volatile Tables) Tip 17: Use Same PI in Source & Target  If the Source and Target have the same PI.  SAMPLE does extra processing to try to randomize the result.  We cannot able to collect statistics on volatile tables.

as it is much more efficient.  Internally. Check if this holds good for your query. Tip 21: Unnecessary casting for DATE columns  Avoid unnecessary casting for DATE columns. you can compare date columns against each other even when they are in different format. 1001. 1003. Example: SELECT customer_number.  Sometimes replacing UPDATE with DELETE & INSERT can save good number of AMPCPU. Tip 24: IN or BETWEEN  In case of a choice of using the IN or the BETWEEN clauses in the query. 1002. Tip 20: UPDATE clause and replacing UPDATE with DELETE & INSERT  Do not write UPDATE clause with just SET condition and no WHERE condition.  Once defined as DATE. CAST is required mainly when you have to compare VARCHAR value as DATE. Especially for attribute having lots of NULL values/Unique known values. DATE is stored as INTEGER. customer_name FROM customer WHERE customer_number in (1000. Tip 23: Use COMPRESS  Use COMPRESS in whichever attribute possible in table creation statement. 1004). So avoid User Defined Functions until and unless there is no other way. Whitepaper | TERADATA PERFORMANCE TUNING . it is advantageous to use the BETWEEN clause.  Even if the Target/Source has just one row.P a g e | 20 Tip 19: NO LOG for volatile tables  Create volatile tables with NO LOG option.  This helps in reducing IO and hence improves performance. add WHERE clause for PI column. Tip 22: Avoid UDF  Most of the functions are available in Teradata for data manipulations.

Whitepaper | TERADATA PERFORMANCE TUNING . the Query Optimizer can locate a range of numbers much faster (using BETWEEN) than it can find a series of numbers using the IN clause.P a g e | 21 is much less efficient than: SELECT customer_number. customer_name FROM customer WHERE customer_number BETWEEN 1000 and 1004 Assuming there is a useful index on customer_number. Tip 25: MultiLoad delete or Delete command  MultiLoad delete is faster than normal Delete command. since the deletion happens in data blocks of 64Kbytes.  Transient journal maintains entries only for Delete command since Teradata utilities doesn’t support Transient journal loading. whereas delete command deletes data row by row.

P a g e | 22 6. Glossary Acronym Expansion SLA Service Level Agreements ROI Return On Investment DBQL Data Base Query Log AMP Access Module Processor PE Parsing Engine UPI Unique Primary Index NUPI Non Unique Primary Index PPI Partitioned Primary Index FTS Full Table Scan PI Primary Index SI Secondary Index Whitepaper | TERADATA PERFORMANCE TUNING .