Easy Statistics Recommendations  Statistics Wizard Feature
The principle focus of query tuning is to provide reliable summary information about the data to the optimizer. This is done by collecting accurate statistics which are then stored in a synoptic data structure known as an interval histogram. The correct choice of the column and index sets on which Statistics should be collected can help the optimizer generate better query plans, dramatically improving query performance and reduce the collection overhead. It can be difficult to understand how the optimizer uses statistics as well as deciding what statistics are needed without an automated method to help. That automated method is the Teradata Statistics Wizard, which is a clientbased GUI interface for obtaining statistics recommendations for particular queries or query workloads that are submitted for analysis.
Easy Feature of the Statistics Wizard
Statistical information availability is critical for optimal query plans, but Statistics collection is a time and resource consuming job. Appropriate and uptodate Statistics can improve the performance of a query. As the Teradata Database becomes more powerful, applications and the queries they perform have also become more complex and sophisticated. The job to decide upon adequate and appropriate Statistics collection, recollection of Statistics, and the correct balance between good query plans and the time required to collect Statistics on complex databases has become very complicated.
The Easy Feature of the Statistics Wizard simplifies this by automating the process of collection/recollection of Statistics for particular database/tables by applying the best practices based upon the data demographics information available in the database system.
Approaches to determine the recommendations:
There are two main approaches in using the Easy Feature for Statistics recommendations:
1) Full Statistics Recommendations are based on the following attributes:
 Age of last collection in days
 Table growth
 Table skew
 General statistics rules
 Search for new tables/ indexes
2) Sampled Statistics Recommendations are based on the following attributes:
 Tables with greater than 1000 rows/AMP
 Unique index columns
 Highly unique indexes
Purposes of Statistics:
The following are the most important reasons for column and index statistics.
 The optimizer uses statistics to decide whether it should generate a query plan that uses a secondary, hash, or join index instead of performing a fulltable scan.
 The optimizer uses statistics to estimate the cardinalities of intermediate spool files based on the qualifying conditions specified by a query.
 The estimated cardinality of intermediate results is critical for the determination of both optimal join orders for tables and the kind of join method that should be used to make those joins.
 For PPI tables, statistics collected on the PARTITION systemderived column permit the optimizer to better estimate costs. The system also uses PARTITION Statistics for estimates when predicates are based on the PARTITION column.
Sampled Statistics:
In many cases, sampled statistics are not accurate enough for the optimizer to generate an optimal or even a good join plan. However, it is sometimes true that statistics collected by sampling small subpopulations of table rows can be as good as those collected with a fulltable scan. The value of collecting fulltable statistics is that they provide the optimizer with the most accurate information that can be gathered for making the best possible query plan cost estimates.
Statistical accuracy is fundamentally important for any query plan because the effect of suboptimal access and join plans generated from inaccurate statistics, of which there can be many in the optimization of a complex query, is significant.
Statistics and optimizer estimates also play a critical role in Teradata Active System Management (TASM) configuration and strategy.
Comparison of Full Statistics Vs. Sample Statistics:
Statistics Method  Characteristics  Best Use 
Full Statistics 


Sampled Statistics 


Full Statistics Recommendations:
If the ‟Full Statistics Recommendations‟ radio button is selected, the Statistics Wizard gives Full Statistics Recommendations based on the following attributes for the selected objects:
 Age of Last Collection in days
 Table Growth in %
 Table Skew
All NUPIs
All NUSIs
 General Statistics Rules
All Indexes for Join Index Table
All NUSIs
All VOSI
All Partitioned Tables
Search for New Tables/ Indexes
Primary Index/Key
Foreign Key
Example: lineitem table definition
CREATE SET TABLE lineitem, NO FALLBACK, NO BEFORE JOURNAL, NO AFTER JOURNAL, CHECKSUM = DEFAULT ( L_ORDERKEY INTEGER NOT NULL, L_PARTKEY INTEGER NOT NULL, L_QUANTITY FLOAT, L_SHIPMODE CHAR(10)CHARACTER SET LATIN CASESPECIFIC NOT NULL DEFAULT 'IIA4 ', L_SHIPINSTRUCT VARCHAR(45) CHARACTER SET LATIN NOT CASESPECIFIC NOT NULL DEFAULT 'IIA5', L_SUPPKEY INTEGER, L_LINENUMBER INTEGER, L_EXTENDEDPRICE FLOAT, L_LINESTATUS CHAR(1) CHARACTER SET LATIN NOT CASESPECIFIC NOT NULL DEFAULT 'D') UNIQUE PRIMARY INDEX (L_ORDERKEY) INDEX (L_QUANTITY) INDEX ALL (L_PARTKEY, L_QUANTITY, L_SHIPMODE) ORDER BY VALUES (L_PARTKEY) INDEX (L_SHIPINSTRUCT, L_SUPPKEY, L_LINENUMBER, L_EXTENDEDPRICE, L_LINESTATUS) ORDER BY VALUES (L_SUPPKEY) INDEX (L_PARTKEY, L_QUANTITY, L_LINENUMBER) ORDER BY VALUES (L_LINENUMBER) INDEX (L_LINENUMBER, L_EXTENDEDPRICE, L_LINESTATUS) UNIQUE INDEX (L_PARTKEY);
Age of Last Collection in days:
Recollection of statistics for all columns is recommended if the days elapsed since the last collection exceeds a userspecified threshold.
Example: The last collection on the lineitem table exceeds the userspecified threshold (10 days from date of 05/19/2008).
HELP STATISTICS lineitem; *** Help information returned. 3 rows. *** Total elapsed time was 1 second. Date Time Unique Values Column Names     03/16/2008 16:26:44 1,280 L_PARTKEY , L_QUANTITY, L_SHIPMODE 03/16/2008 16:26:44 1,280 L_PARTKEY , L_QUANTITY, L_LINENUMBER 05/19/2008 10:37:15 1,280 L_SHIPINSTRUCT,L_SUPPKEY,L_LINENUMBER, L_EXTENDEDPRICE , L_LINESTATUS
Statistics were previously collected on (L_PARTKEY, L_QUANTITY, L_SHIPMODE) and (L_PARTKEY , L_QUANTITY, L_LINENUMBER) columns of lineitem table on 03/16/2008. The user selected 10 days as the recollection option and the last collected statistics date exceeded 10 days (from 05/19/2008). Teradata Statistics Wizard will recommend recollection of statistics on columns referred to in the first two rows. No recommendations are provided for last row because statistics were collected on same day (05/19/2008).
Statistics recollection recommendations are based on the previous collected Statistics method. (If originally Sample Statistics were collected on tables then Teradata Statistics Wizard will recommend recollection of Sample Statistics).
Collect Statistics on lineitem Index (L_PARTKEY, L_QUANTITY, L_SHIPMODE); Collect Statistics on lineitem Column (L_PARTKEY, L_QUANTITY, L_LINENUMBER);
Table Growth:
Table growth recommends recollection for all columns that have a change in row count exceeding a userspecified threshold since statistics were last collected. If the total rows count value from the existing statistics is less or greater than the userspecified threshold value then recollection statistics recommendations are generated.
Example: 128 rows have been inserted to table lineitem since the last statistics colletion. (last row count 1280 + new inserted rows 128 = 1408) (refer to Help Statistics output in previous section).
Select count (*) from lineitem; *** Query completed. One row found. One column returned. *** Total elapsed time was 1 second. Count (*)  1408
Teradata Statistics Wizard provides the following statistics recollection recommendations. The row count of lineitem table has exceeded the userspecified threshold (10%).
Table growth recommendations (if full statistics were collected previously):
Collect Statistics on lineitem index (L_PARTKEY, L_QUANTITY, L_SHIPMODE); Collect Statistics on lineitem index (L_SHIPINSTRUCT,L_SUPPKEY,L_LINENUMBER, L_EXTENDEDPRICE, L_LINESTATUS); Collect Statistics on lineitem index (L_PARTKEY, L_QUANTITY, L_LINENUMBER);
Statistics on Skewed Data:
When you collect statistics for skewed data, the optimizer can accommodate exceptional values.
Statistics reveal values that include the most frequent value in the table and the most frequent value per value range. The system divides the table into 100 groupings and maintains statistics on the most nonunique value/range.
Without collected statistics, the system derives row counts from a random AMP sample for:
 Small tables (less than 1000 rows per AMP).
 Unevenly distributed (skewed row distribution due to PI) tables.
Small tables often distribute unevenly. If rows in a table are not distributed evenly across all AMPs, random AMP samples may not represent the true total number of rows in the table.
Table Skew – All NonUnique Primary Indexes:
Full Statistics availability is needed for highly skewed data for the best plan generation. The optimizer needs to be fully aware of such skew. Teradata Statistics Wizard recommends statistics collection on all nonunique primary indexes for tables that have table skew that exceed a userconfigured threshold.
SQL to calculate table skew on all NUPI:
SELECT di.databasename, di.tablename, di.indexname, di.indextype, di.uniqueflag, SUM(cast(t.currentperm as decimal(18,0))/(1024*1024)) perm_mb, 100*((cast(max(t.currentperm ) as decimal(18,3))  cast(ave(t.currentperm ) as decimal(18,3)))) /(nullif(cast(max(t.currentperm) as decimal(18,3)),0)) as tableskew FROM dbc.indices di, dbc.tablesize t WHERE di.databasename in ('<DBName') AND di.tablename in ('<TableName>') AND di.indextype = 'P' AND di.uniqueflag = 'N' AND di.databasename = t.databasename AND di.tablename = t.tablename GROUP BY di.databasename, di.tablename, di.indextype, di.indexname, di.uniqueflag ORDER BY di.databasename, di.tablename, di.indextype, di.indexname;
Table Skew – All NonUnique Primary Index recommendations:
Collect Statistics on NUPITABLE index (NUPI_Col1);
Table Skew – All NonUnique Secondary Indexes:
Teradata Statistics Wizard recommends statistics collection on all nonunique secondary indexes for tables that have index columns skews that exceed a userconfigured threshold.
SQL to calculate table skew on all NUSI:
SELECT di.databasename, di.tablename, di.indexname, di.indextype, di.uniqueflag, SUM(cast(t.currentperm as decimal(18,0))/(1024*1024)) perm_mb, 100*((cast(max(t.currentperm ) as decimal(18,3))  cast(ave(t.currentperm ) as decimal(18,3)))) /(nullif(cast(max(t.currentperm) as decimal(18,3)),0)) as tableskew FROM dbc.indices di, dbc.tablesize t WHERE di.databasename in ('<DBName') AND di.tablename in ('<TableName>') AND di.indextype = 'S' AND di.uniqueflag = 'N' AND di.databasename = t.databasename AND di.tablename = t.tablename GROUP BY di.databasename, di.tablename, di.indextype, di.indexname, di.uniqueflag ORDER BY di.databasename, di.tablename, di.indextype, di.indexname;
Table Skew – All NonUnique Secondary Index recommendations:
Collect Statistics on lineitem index (L_LINENUMBER, L_EXTENDEDPRICE ,L_LINESTATUS); Collect Statistics on lineitem index (L_QUANTITY);
General Statistics Rules  All Indexes for Join Index:
The COLLECT STATISTICS statement collects demographics, computes Statistics from them, and writes the resulting data into individual entries for each individual base table and join index table on the system.
When join indexes are used for estimates, you might not need to collect statistics on the base table columns that are covered by the join index with some exceptions. For example, statistics are still required on base table index columns that have singletable predicates in order to do access path analysis. Similarly, statistics might be required if the covered columns are used as join columns or grouping columns in a GROUP BY clause.
Consider the following query and join index definition:
SELECT * FROM t1, t2 WHERE nusi_col = 10 AND a1 > 20 AND t1.a1 = t2.a2; CREATE JOIN INDEX ji AS SELECT a1, b1 FROM t1 WHERE nusi_col = 10 AND a1 > 20;
In this case, the join index cardinality provides the final selectivity for t1; However, statistics on column a1 might still be required to make join cardinality, rows per value, and other cardinality estimates. Ideally, the statistics on a1 from the join index are more accurate than the base table statistics. The optimizer first attempts to locate the join column a1 in the join index and then attempts to use its statistics. If this column is not projected, or if the statistics are not available, then the optimizer looks for statistics on the base table.
When usable singletable join indexes are found for singletable cardinality estimations, and the columns referenced in the join predicates or GROUP BY clause are projected from the join index, then the Statistics Wizard recommends statistics on the join index columns if they are not available. For the previous example, the Statistics Wizard recommends collecting statistics on ji.a1 since the column a1 is projected in the join index. If this column is not projected, then the Statistics Wizard recommends collecting statistics on the base table columns t1.a1
Column statistics for join indexes and their underlying base tables are not interchangeable. You need to submit separate COLLECT STATISTICS statements for the columns in the join index and the source columns in the base tables. Join index tables and data tables are seen as separate entities to the optimizer.
General Statistics Rules  All NonUnique Secondary Indexes (ALL Option):
Collecting STATISTICS on NUSIs makes it possible for the optimizer to decide when to use the NUSI for access and when to do a full table scan.
Teradata Statistics Wizard recommends collect statistics on all nonunique secondary indexes due to the data skew possibilities described previously.
General  All NonUnique Secondary Index recommendations:
Collect Statistics on lineitem Index (L_PARTKEY, L_QUANTITY, L_SHIPMODE);
General Statistics Rules  All VOSI (Value ordered NUSI):
Valueordered NUSIs are useful for processing range conditions and conditions with an inequality on the secondary index column set.
By sorting the NUSI rows by data value, it is possible to search only a portion of the index subtable for a given range of key values. The major advantage of a valueordered NUSI is in the performance of range queries.
The following example illustrates a valueordered NUSI (defined by an ORDER BY clause that specifies the VALUES keyword option) and a query that would probably be solved more efficiently if the specified valueordered NUSI were selected by the optimizer to access the requested rows.
CREATE INDEX Idx_Date (o_orderdate) ORDER BY VALUES (o_orderdate) ON Orders; SELECT * FROM Orders WHERE o_orderdate BETWEEN ‘19971001’ AND ‘19971007’;
Teradata Statistics Wizard recommends statistics on all VOSIs.
General  All Value Ordered NUSI recommendations
Collect Statistics on lineitem Index (L_SHIPINSTRUCT, L_SUPPKEY, L_LINENUMBER, L_EXTENDEDPRICE , L_LINESTATUS); Collect Statistics on lineitem Index (L_PARTKEY, L_QUANTITY, L_LINENUMBER);
General Statistics Rules  All Partitioned Tables:
If there are any predicates on the partitioning columns of a PPI and the system does partition elimination, or if there are any join predicates on the primary index or the partitioning column of the PPI table, then the Statistics Wizard recommends singlecolumn partition statistics on the systemderived PARTITION column with High Confidence.
If the system does partition elimination, then multicolumn PARTITION Statistics are recommended with Low Confidence on the systemderived PARTITION column and other join columns. This helps to derive the column correlations for the other join columns based on the qualified partitions.
For example, suppose you submit the following query:
SELECT * FROM t1_ppi, t2 WHERE t1_ppi.partition_column BETWEEN 10 AND 20 AND t1_ppi.x1 = t2.y1;
If the qualified partitions are (4, 7) then it is helpful for the optimizer to derive the demographics of the join column t1_ppi.x1 for the given partitions (4, 7) in order to provide more accurate estimation and costing. The recommendations need to be implemented only if the partitioning column and the join columns are correlated.
General Statistics Rules  Search for New Tables/Indexes:
This option will search all the new tables or indexes created after a specified date and generate the recommendations on new tables or newly added indexes.
For example, if the following new table is created after a specified date (i.e. CreateTimeStamp column of DBC.TVM table),
CREATE MULTISET TABLE supplier , NO FALLBACK , NO BEFORE JOURNAL, NO AFTER JOURNAL, CHECKSUM = DEFAULT ( S_SUPPKEY INTEGER NOT NULL, S_NAME CHAR(25) NOT NULL, S_ADDRESS VARCHAR(40) NOT NULL, S_NATIONKEY INTEGER NOT NULL, S_PHONE CHAR(15) NOT NULL, S_ACCTBAL DECIMAL(15,2) NOT NULL, S_COMMENT VARCHAR(101) NOT NULL) UNIQUE PRIMARY INDEX ( S_SUPPKEY ) INDEX ( S_NATIONKEY ); Teradata Statistics Wizard recommends the following Statistics collection: Collect Statistics on supplier Index (S_SUPPKEY); Collect Statistics on supplier Index (S_NATIONKEY);
Following are the recommendations when Statistics was not collected for any of ‘ProcessOrder’ tables as all tables are newly added after specified date.
Sampled Statistics Recommendations:
Collecting statistics on a sample of the data reduces the resources required and the time to perform statistics collection.
If the columns are not indexed, then the rows are organized randomly, so the system just scans the first n percent it finds, where the value of n is determined by the relative presence or absence of skew in the data. Conceivably, the entire sample could be taken from the first data block on each AMP, depending on the system configuration and cardinality of the table being sampled. Particularly avoid using sampled statistics with skewed columns or indexes. For example, columns or indexes that are unique or nearly unique are uniformly distributed, and a good candidate for sampled statistics. However, sampling should not be considered for data that is highly skewed because the optimizer needs to be fully aware of such skew. In addition to uniformly distributed data, full statistics collection on NUSIs is so fast and efficient that there is not much collection overhead to improve upon going to sampling. Sample statistics for NUSIs is not worth the reduction of information available for the minimal collection overhead saved.
To summarize, sampled statistics are generally most appropriate for:
 Data that is uniformly distributed
 Nearunique or unique primary index columns
 Very large tables, but only when collecting full statistics would be prohibitive
Teradata Statistics Wizard recommends sampled Statistics based on following rules.
 Tables with greater than 1000 rows/AMP
 Unique columns
 Highly unique columns or indexes
 Sample Percentage (Default 2%)
Note: The quality of the statistics collected with sampling is not guaranteed to be as good as the quality of statistics collected on an entire table without sampling, but is most likely better than having no statistics at all. When you use sampled statistics rather than fulltable statistics, you are trading reduced statistics collection time and resources for less accurate statistics.
Sampled Statistics Recommendations Table with greater than 1000 rows/AMP:
Sampled Statistics on a small table must be avoided as small tables (fewer rows/AMPs) may end up with very inaccurate row counts. Sampled Statistics can be recommended on big tables (e.g. Size > 1000 rows/AMP).
Teradata Statistics Wizard recommends sample statistics for all tables with a size greater than 1000 rows/AMP. The default sample size is 2%.
Example: Total number of Amps: Select HashAmp () +1; *** Query completed. 3 rows found. One column returned. *** Total elapsed time was 1 second. (HASHAMP () +1)  2 Ie: The database contains 2 Amps. Total number of rows per Amps Select Count (*)/(Hashamp()+1) from lineitem; *** Query completed. One row found. One column returned. *** Total elapsed time was 1 second. (Count(*)/(HASHAMP()+1))  647
No Sampled recommendations are generated by Teradata Statistics Wizard as the table must contain more than 1000 rows per AMP.
Inserting rows to table Insert 707 rows to the existing table lineitem. Select Count (*)/(Hashamp()+1) from lineitem; *** Query completed. One row found. One column returned. *** Total elapsed time was 1 second. (Count(*)/(HASHAMP()+1))  1001
Now there are more than 1000 rows per AMP.
Sampled Statistics Recommendations:
Collect Statistics using sample on lineitem column (L_ORDERKEY);
Sampled Statistics Recommendations Unique Columns:
Sampled Statistics are very reasonable for columns that are unique or almost unique, so Teradata Statistics Wizard recommends sampled statistics for all unique columns based on unique constrains defined on tables.
Example: Collect Statistics using sample on lineitem columns (L_ORDERKEY); Collect Statistics using sample on lineitem columns (L_PARTKEY);
Sampled Statistics Recommendations Unique Indexes:
Sampled Statistics are very reasonable on unique indexes (like UPI) because of data uniqueness distribution, so Teradata Statistics Wizard recommends sampled statistics on all unique indexes based on unique index attributes on selected data objects.
Example: Collect Statistics using sample on lineitem index (L_ORDERKEY); Collect Statistics using sample on lineitem Index (L_PARTKEY); Note the overlap in the unique column and unique index recommendations. The following will be recommendations if both unique columns and unique indexes are selected: Collect Statistics using sample on lineitem index (L_ORDERKEY); Collect Statistics using sample on lineitem columns (L_PARTKEY);
Conclusion and Summary:
Statistics will always be the most complete when they are collected fully. But due to the time and resources required in the statistics collection effort, it makes sense to consider alternative approaches, if they can provide a similar level of accuracy.
USING SAMPLE Statistics
 Significantly faster than full Statistics collection
 Acceptable for columns or indexes that are “highly unique,” that is, whose number of distinct values approaches the row count of the table
 Not recommended for small tables, that is, tables whose row count is less than 20 times the number of AMPs in the system
When to use:
For columns that are unique, such as unique primary indexes, or for columns or indexes that are close to unique. Some customers, experienced in USING SAMPLE, suggest that USING SAMPLE provides particular value on very large tables, tables with tens of billions of rows.
Full Statistics
 Best choice on columns or indexes when data is moderately or greatly skewed
 Recommended for small tables, that is, tables with fewer than 1000 rows per AMP
 For selection columns whose number of distinct values is moderate or low
 For most NUSI’s and other selection columns
When to use:
For all columns/indexes where collected Statistics add value, and where USING SAMPLE does not provide satisfactory estimates.
Statistics collection is an ongoing job. Because it is a highlyrepetitive activity, it is worth taking some time to investigate the optimal way to combine these different approaches at your site.
Because a mixed approach to statistics collection requires columnbycolumn (or indexbyindex) analysis, it may take some time and effort to apply comprehensively. However, by focusing only on the larger tables, where full statistics collection time is the greatest, significant savings can be experienced with a moderate effort.
Nice overview of the tool and options. I just want to stress to people the importance of the first line in your "when to use" summary. The output of the tool are "recommendations". You must apply, test and either keep or remove the recommended stats. Do not just apply *ALL* recommendations , only the ones that add value.
Dave