Authors: Carrie Ballinger, Rama Krishna Korlapati, Paul Sinclair

Looking for a fresh perspective on collecting statistics, or just want a confirmation that you’re on the right track? Either way, you’ll want to read this quick summary of recommendations for Teradata 12 stats collection.

Statistical information is vital for the optimizer when query plans are built. But collecting statistics can involve time and resources. By understanding and combining several different statistics gathering techniques, users of Teradata can find the correct balance between good query plans and the time required to ensure adequate statistical information is always available.

This recently-updated compilation of statistics collection recommendations are intended for sites that are on any of the Teradata 12 software release levels. All these recommendations apply to releases earlier than Teradata 12, with the exception of Point #4 and #5 under “Other Considerations”. Those Point #4 and #5 recommendations take advantages of optimizer enhancements that are part of the Teradata 12 release and are not relevant to earlier releases.


Collect Full Statistics 

  • Non-indexed columns used in predicates
  • All NUSIs with an uneven distribution of values * 
  • NUSIs used in join steps
  • USIs/UPIs if used in non-equality predicates (range constraints)
  • Most NUPIs (see below for a fuller discussion of NUPI statistic collection)
  • Full statistics always need to be collected on relevant columns and indexes on small tables (less than 100 rows per AMP)


Can Rely on Random AMP Sampling   

  • USIs or UPIs if only used with equality predicates
  • NUSIs with an even distribution of values
  • NUPIs that display even distribution, and if used for joining, conform to assumed uniqueness (see Point #2 under “Other Considerations” below)
  • See “Other Considerations” for additional points related to random AMP sampling


Option to use USING SAMPLE

  • Unique index columns
  • Nearly-unique columns or indexes**


Collect Multicolumn Statistics

  • Groups of columns that often appear together in conditions with equality predicates, if the first 16 bytes of the concatenated column values are sufficiently distinct. These statistics will be used for single-table estimates.
  • Groups of columns used for joins or aggregations, where there is either a dependency or some degree of correlation among them.***  With no multicolumn statistics collected, the optimizer assumes complete independence among the column values. The more that the combination of actual values are correlated, the greater the value of collecting multicolumn statistics will be.


Other Considerations

1. Optimizations such as nested join, partial GROUP BY, and dynamic partition elimination will not be chosen unless statistics have been collected on the relevant columns.

2. NUPIs that are used in join steps in the absence of collected statistics are assumed to be 75% unique, and the number of distinct values in the table is derived from that.  A NUPI that is far off from being 75% unique (for example, it’s 90% unique, or on the other side, it’s 60% unique or less) will benefit from having statistics collected, including a NUPI composed of multiple columns regardless of the length of the concatenated values. However, if it is close to being 75% unique, then random AMP samples are adequate. To determine what the uniqueness of a NUPI is before collecting statistics, you can issue this SQL statement:

EXPLAIN SELECT DISTINCT nupi-column FROM table;

 3. For a partitioned primary index table, it is recommended that you always collect statistics on:

  • PARTITION. This tells the optimizer how many partitions are empty, and how many rows are in each partition. This statistic is used for optimizer costing.
  • The partitioning column. This provides cardinality estimates to the optimizer when the partitioning column is part of a query’s selection criteria.

4. For a partitioned primary index table, consider collecting these statistics if the partitioning column is not part of the table’s primary index (PI):

  • (PARTITION, PI). This statistic is most important when a given PI value may exist in multiple partitions, and can be skipped if a PI value only goes to one partition. It provides the optimizer with the distribution of primary index values across the partitions. It helps in costing the sliding-window and rowkey-based merge join, as well as dynamic partition elimination.
  • (PARTITION, PI, partitioning column). This statistic provides the combined number of distinct values for the combination of PI and partitioning columns after partition elimination. It is used in rowkey join costing.

5. Random AMP sampling has the option of pulling samples from all AMPs, rather than from a single AMP (the default). All-AMP random AMP sampling has these particular advantages:

  • It provides a more accurate row count estimate for a table with a NUPI. This benefit becomes important when NUPI statistics have not been collected (as might be the case if the table is extraordinarily large), and the NUPI has an uneven distribution of values.
  • Statistics extrapolation for any column in a table will not be attempted for small tables or tables whose primary index is skewed (based on full statistics having been collected on the PI), unless all-AMP random AMP sampling is turned on. Because a random AMP sample is compared against the table row count in the histogram as the first step in the extrapolation process, an accurate random AMP sample row count is critical for determining if collected statistics are stale, or not.

 

 * Uneven distribution exists when the High Mode Frequency (ModeFrequency column in interval zero) in the histogram is greater than the average rows-per-value (RPV) by a factor of 4 or more.  RPV is calculated as Number of Rows / Number of Uniques.

** Any column which is over 95% unique is considered as a neary-unique column.

*** Correlated columns within a multicolumn statistic are columns where the value in one may influence, or predict the values in the second. For example in a nation table, there is a tight correlation between nationkey and nationname. In a customer table there might be a correlation, but a somewhat looser correlation, between customer zip code and customer income band.

Dependent columns are columns where the value in the one column will tend to directly influence the value in the second column. An example of a dependent column could be customer zip code, which is dependent on the customer state. If they both appeared in a multicolumn statistic they would be a dependency between them. Other columns where there is some dependency might be job title which is sometimes dependent on the industry segment, if they both were in a multicolumn stat.
 

Discussion
mnstr 1 comment Joined 01/09
17 Jun 2009

nicee Recommendations !! thanks alot :D

Srividhya80_b 2 comments Joined 07/05
24 Jun 2009

With regard to Multi Column Stats, Can you please illustrate on "where there is either a dependency or some degree of correlation among them".
What kind of dependancy is refered here. Can you please explain?

Regards Srividhya

carrie 595 comments Joined 04/08
25 Jun 2009

I have incoporated the following response, and a few other minor updates, to the original article.

Response:

Correlated columns within a multicolumn statistic are columns where the value in one may influence, or predict the values in the second. For example in a nation table, there is a tight correlation between nationkey and nationname. In a customer table there might be a correlation, but a somewhat looser correlation, between customer zip code and customer income band.

Dependent columns are columns where the value in the one column will tend to directly influence the value in the second column. An example of a dependent column could be customer zip code, which is dependent on the customer state. If they both appeared in a multicolumn statistic they would be a dependency between them. Other columns where there is some dependency might be job title which is sometimes dependent on the industry segment, if they both were in a multicolumn stat.

leo.issac 1 comment Joined 07/06
12 Aug 2009

First thing first, this is a very interesting platform for the teradata developers to share their thoughts about teradata.

Now coming to my questions, Suppose I have a query which retreives records from a single-table based on conditions on where clause. If we assume the where clause conditions are based on Primary Index columns, Non-Index columns and PPI columns, How does collection of stats help the query? I assume this would only help in giving the estimates , however, for this type of queries, no matter what the stats are on this table, the end-end run time should be same for a single table retreival. Can you enlighten me on this ?

carrie 595 comments Joined 04/08
13 Aug 2009

Good question.

If the access of a single table is by primary index in an equality condition, statistics are not used and this will be a single-AMP operation, with no table scan.

If the access of single table by primary index uses a range constraint, then statistics on the PI will be used to produce a reasonable time and row count estimate, but the plan, since it only uses a single table, will not change.

The accuracy of these types of estimates could be important because TASM can use estimated processing time to classify queries to different workloads that have different priorities. This makes the accuracy of statistics important, even when they do not change the plan, and this classification could impact the elapsed time of the query.

In addition, there is an option on system/object throttles called a step time threshold. This option allows work to run immediately (and never be delayed) if all the steps in the query have estimated processing times that are less than the specified step time threshold. This is a good technique to use when the throttle manages both short work and long work and you don't want the short work delayed. So accurate estimates are important for this feature to work as intended.

The same is true for collecting statistics on the partitioning column and on PARTITION. They may not change the plan if only one table is involved, but they allow for more accurate statistics for the purposes of TASM classification.

tdusr 1 comment Joined 09/09
24 Sep 2009

What is the difference in plan when stats are collected only on PARTITION or only on PPI column if the query uses join on PI and PPI with single table predicate on both PPI columns?

carrie 595 comments Joined 04/08
30 Sep 2009

It sounds like this join you are describing is between 2 PPI tables, both partitioned identically, and the join is on both the PI and the partitioning columns of both tables. In this case the partitioning column stats (on both tables) will be used to provide estimated time and row counts for the step(s) that access the tables. You may not get a plan change without partitioning column stats, but your estimates will be more accurate with those stats. That could be important for TASM classification purposes, or if there is a step time threshold on an object throttle associated with this query.

In the case of PARTITION, collecting or not collecting on this system-derived column will not change estimated row counts, and is unlikely to change the plan of such a simple query as you describe. However, in some other possible cases a less optimial join plan may be selected if there are no stats on PARTITION, and since collecting stats on PARTITION is such as quick operation to perform, it may help other more complex queries in the future if these statistics are available.

In the absence of PARTITION statistics, the optimizer assumes rows are distributed evenly among all partitions. If this is not the case, you could end up with incorrect IO costing. This may affect single-table access paths when there are secondary indexes and/or it could affect the join order.

Karam 25 comments Joined 07/09
08 Oct 2009

Hello Carrie...

I'm a big fan of your postings.Nice to find you here.It will surely bring quality to the forum.
I have a basic question- Which table columns can go without the need of collecting statistics? In my present engaugement,I have seen the trend of collecting statictics on all the columns, combination of columns. This in turn increase the time taken for statistics execution.

carrie 595 comments Joined 04/08
08 Oct 2009

Thank you for your kind words, and for taking time to add a comment.

The most common place I see unneeded statistics collected is with multicolumns stats. This is especially true if they are lengthy (greater than 16 bytes) and if their first 16 bytes don't offer much differentiation. I think I have a blog posting from July that discusses dropping stats for multicolumn stats under some conditions, so you might want to find that.

One of the reason these are first-line candidates for dropping is that in order for them to be used by the optimizer for single table selection purposes, each column must be expressed in an equal condition. If most of your queries use these columns in a combination of equality and range selection, or don't use the columns together, the statistics are not providing you value.

If the site has been relying on HELPSTATS output to decide what to collect on, that feature goes overboard when it makes its recommendations for multicolumn stats, particularly the V2R6.2 and earlier releases. In Teradata 12, HELPSTATS does a better scoping job, and you won't see so many extreme recommendations.

For unique or near-unique columns, try using USING SAMPLE. It's a big savings, even if you can only apply it to a few columns here and there.

DeepthiYella 1 comment Joined 11/09
03 Mar 2010

hi carrie

how are you? i need a help from you that i am new to teradata database will you explain the teradata architecture,basics and teradata SQL with all tips and tricks for giving certifications and as well as for interview purposes

Thanks & Regards
Deepthi

carrie 595 comments Joined 04/08
09 Mar 2010

To learn about Teradata, see the courses offered via the Teradata Education Network at www.teradata.com/t/TEN.

To learn about the Teradata Certified Professional Program, see the TCPP website at www.teradata.com/Certification.

duddukuru 1 comment Joined 04/10
13 Apr 2010

Hi Carrie,
I see this information so useful. I have tested all the recommendations and they worked!
Just want to check with you if I can share this information with my team/company so that they also can effectively use this and unleash power of Teradata. Thank you

Regards,
Hari

Aa 1 comment Joined 05/10
11 May 2010

Hi. Like using Explain query to get the Recommend Stats on table columns, is there anyway that Index Recommendations too can be got?

carrie 595 comments Joined 04/08
11 May 2010

Hari,

Please share this information with your team if you think it will be useful to them. I am glad to hear you found it helpful. Thanks so much, -Carrie

carrie 595 comments Joined 04/08
11 May 2010

Aa,

There is no method that I am aware of for getting index recommendations from out of the explain. There is a tool that is part of the Teradata Analyst Pack that you could use for that purpose. It is called the Index Wizard, and there is a separate manual in the official Teradata documentation called "Teradata Index Wizard User Guide."

The Index Wizard requires that you select a workload (a defined set of queries), and it performs index analysis on that workload for you and makes suggestions to you. You might want to check that out.

Thanks, -Carrie

JackSparrow 1 comment Joined 06/10
13 Jun 2010

Hi Carrie,

I have a question regarding stat collection, where query performance degrades after stat collection.

I have a partitioned primary index table where partitioning column is not part of the table’s primary index (PI). everyday around 1 million records are added to this table with duplicate PI value and we are keeping 1 year of data in this table(for testing i have around 500 million records and 194 partitions)

I have a SQL in which i am using above table and joining to another table(this table has same field as NUPI and values are unique), joining both tables on PI columns. In where clause i have date filter on partition column. Stats are collected on second table but not on partitioned table. SQL takes around 1.2 CPU secs, explain plan shows around .70.
Now if i collect stats on Partition and PI, same query now takes 3+ CPU secs, also explain plan is same except

No Stats Collected
...estimated with no confidence to be 4,245 rows(853,245 bytes). The estimated time for this step is 0.08 secs

after stat collection
...estimated with no confidence to be 6,653 rows(1,337,253 bytes). The estimated time for this step is 0.25 secs

Similarly number of rows and time increases for some of step in explainplan after i collect stats. My question is why is this happening. Sorry i can't post SQL or plan for some reason.

carrie 595 comments Joined 04/08
15 Jun 2010

Hi Jack,

The fact that you are getting "no confidence" in your explain text AFTER collecting statistics makes me think that you might have missed some important statistics. In addition to collecting stats on partition and the PI, it is always important with PPI tables to collect statistics on the partitioning column. Was the partitioning column included in your stats collections? Also make sure that the join columns of both tables have statistics. You may also want to consider the multi-column statistics special for PPI tables that are mentioned in the last part of the statistics recommmendations document, specifically:

(PARTITION, PI)
(PARTITION, PI, partitioning column)

Both of the above statistics can contribute to improved join costing, particularly on PPI tables with NUPIs.

Thanks, -Carrie

RobertDiehl 2 comments Joined 05/10
14 Sep 2010

Carrie,

If you are collecting on Partition, PI or Partition, PI and PPI ,can you use sampled stats?

Thanks,

Bob Diehl

carrie 595 comments Joined 04/08
15 Sep 2010

Hi Bob,

You may use sampled statistics on primary indexes. In fact if the PI is unique, it's a recommended approach, because sampling will always be pretty accurate with unique data. If it's non-unique and lumpy, or is less than 95% unique better to use full stats.

The system will not let you collect stats on partitioning columns. You will get an error if you attempt to do that (try it and see). That is because the data in a PPI table is stored physically by partitioning column first, so if you were to sample 2% of a table, the first 2% could all be coming from the first partition, and it would appear that all, or most, partitioning column values were identical. So it is not allowed for that reason--potential for misleading results.

For single-column PARTITION stats, the sampling percent will automatically be raised the to 100% which means full stats. The same recommendation applies (collect only full stats) for multicolumn stats that incude PARTITION. But the database doesn't automatically raise the sample percent to 100% in those cases, at least not in current software. There may be a change to that in future releases.

So the answer is yes, no, and sort of.

Thanks, -Carrie

Toad 1 comment Joined 12/08
25 Nov 2010

Hi Carie,
I have a Table Which is NPPI (PI (ColA) is Unique, but Partitoned columns (Col B and Col C) are not part of PI, Hence NPPI.
Based on your other Recommendations, Is my below understanding righht?
COLLECT STATS ON TABLE INDEX (Col A);
COLLECT STATS ON TABLE COLUMN PARTITION;
COLLECT STATS ON TABLE COLUMN ColB;
COLLECT STATS ON TABLE COLUMN ColC;
COLLECT STATS ON TABLE COLUMN (PARTITION, ColA, ColB);
COLLECT STATS ON TABLE COLUMN (PARTITION, ColA, ColC);

Thanks,
Toad

abdul 2 comments Joined 03/11
25 Mar 2011

Hi carrie,
can u explain the retriving of data from join index table and base table
when i create a join index how it is going to store in data distionary can u explain
can i use select statement on join index table

Thanks
Abdul

carrie 595 comments Joined 04/08
28 Mar 2011

Hi Abdul,

Data is retrieved from a join index the same as from a base table, either using primary index access, or using partition elimination (if PPI has been defined) or doing a table scan. Also, you can build a NUSI on a join index, so you could also have NUSI access as well.

Detail about the join index is held in these data dictionary tables: tvm, tvfields, and indexes. These are the same dictionary tables as used to store the definition of a base table (but with a different kind of code).

You cannot directly access a join index. It is always a decision made by the optimizer whether or not to access a join index.

You can read more about join indexes, how they work, and how to use them, in the Database Design Manual, accessible at: http://www.info.teradata.com

carrie 595 comments Joined 04/08
28 Mar 2011

Sorry, Toad, I totally missed your earlier comment.

Yes, that looks good in terms of stats to collect on, with one exception.

If this is a single-level PPI table and the partitioning colums include (ColB, ColC) as multicolumn set, then you want to collect stats on them as a multi-column stat. If they are part of different partitioning column definitions on different levels of a multi-level PPI table, then collect on them separately.

You may not actually need all those stats, but if you are able to collect them, it will cover a lot of different possibilities where stats could be useful with a PPI table.

Thanks, -Carrie

khachla 1 comment Joined 07/10
29 Mar 2011

Hi Carrie,

I have a question concerning the case of partitioned primary index table where the partioning column is not part of the primary index. As you mentioned in point 3 and 4. The following stats should be collected:
. PARTITION
. partioning column
. (PARTITION, PI)
. (PARTITION, PI, partitioning column)

Is it neccesary to collect stats on PI if we already collect stats on (PARTITION, PI) ?

In general, if we need collect stats on another column (eg. column) on PPI case, should we collect stats only on (column) or (PARTITION, column) or both ? Logically, I think when we have already statistics info on (PARTITION, column), it means we already have stats info on this column so don't need anymore collect stats (column). Am I right ?

tanishq 1 comment Joined 07/11
06 Jul 2011

Hi Carrie,

How can we find un-used statistics of a table ?

e.g. if i have one table tableA , on 5 columns statistics are collected i.e. on col1, col2, col3, col4 & col5 there are total 10 columns in a table
now when i execute any query on the column4 or column5 by specifying some criteria
my query is not using the statistics collected on those 2 columns like that way i would
like to know how many statistics are got collected but never use by any of the queries
please treat this on urgent basis, thanx in advance

carrie 595 comments Joined 04/08
08 Jul 2011

There is no way that I am aware of for determining which statistics are actually used by the optimizer when building a query plan. You have to use your best judgement on which statistics to collect. Once you collect statistics, if the table undergoes growth, you may need to recollect statistics on those same columns. Stale statistics may be worse than no statistics at all.

In general, the recommendations in the above article are the best suggestions we have to guide you in that process.

Thanks, -Carrie

irfan098 7 comments Joined 07/11
13 Jul 2011

Hi Carrie,
My question is about Stats collection after upgrade from TD12 to TD13.10.i have the stats collected on the tables(at columns,indexes,multicolumn) in td12.After upgrade to td13.10,do we only need to recollect stats at table level only ?
collect stats on databasename.tablename ;

or
i need to collect stats at column and index level as well..?

carrie 595 comments Joined 04/08
15 Jul 2011

Hello ilf.

You only need to re-collect statistics at the table level, if what you want is for all previously-collected statistics to be refreshed. When you recollect at the table level, all current stats for that table will be recollected, whether they are at the column or index level. The table level collection will also remember if you have collected statistics with the USING SAMPLE option. Basically, it looks at all the statistics histograms that exist for that table and rebuilds them all.

Thanks, -Carrie

irfan098 7 comments Joined 07/11
16 Jul 2011

thnx alot Carrie

jainayush007 8 comments Joined 03/11
30 Jul 2011

Hi Carie,
Thanks for such an informative post.
We discussed what to collect and when to collect it, the various forms of it as well. I would be intrested to know which ones should NOT be collected. One of my frirnds recently suggested not to collect stats on a highly non unique column. Are there any such recommendations as well?

Thanks.

31 Jan 2012

Hi Carie,

Its a great Article.

I am in process of creating Stats Collection Process. I am targetting below thresholds for any table if require stats collection i.e.

1) Small Tables (if any growth ) than Collect Statistics
2) Large Tables (if growth >10) than collect Statistics
3) Otherwise Collect Stats on the Partition

No my question if my thresholds values are correct than incase of Large Partitioned table will this still be valid. Since there can a scenario where 2% increase in growth is recorded but in new partition. In that case we need to collect stats on the parition only or also on partitioning columns ?

Also a stupid question, how can we find partitioning column information from dbc without parsing DDL.

Thanks

carrie 595 comments Joined 04/08
01 Feb 2012

Laeeq,

You are correct about the re-collection rules being somewhat different for partitioned tables (PPI). If growth is greater than 10% in any one partition, it is advisable to recollect statistics on both PARTITION and partitioning column.

Prior to Teradata 14.0, the only way to get partitioning column information is by parsing DDL. In Teradata 14.0 however, there a column "PartitioningColumn" carried in TVFields (and can be seen in the Columns view). It has a value of Y if the column is a partitioning column. If a second column "ColumnPartitionNumber" is zero, that indicates the column is the partitioning column for a row-partitioned table (as opposed to a column-partitioned table partition).

Thanks, -Carrie

dnoeth 86 comments Joined 11/04
01 Feb 2012

Hi Carrie/Laeeq,
instead of parsing table DDL partitioning columns can be extracted from dbc.TableConstraints, too.

See "PartitioningColumn" in my new stats query:
http://developer.teradata.com/blog/dnoeth/2011/12/how-to-decode-the-binary-statistics-stored-in-dbc-tables

The number of partitions is also returned: "NumValues" for PARTITION stats ("StatsType"='Part')

Dieter

Dieter

carrie 595 comments Joined 04/08
02 Feb 2012

Thanks, Dieter!

22 Feb 2012

Thanks Dieter/Carrie!!

G_Smith 1 comment Joined 04/12
09 Apr 2012

Hello Carrie,
In the case of a large partitioned table, after have a change of 10%, I need re-collect stats on PARTITION and partition column.
My question: at this point (after this 10% change) I need to re-collect in other columns of this table that have statistics? Or it depends on other factors of each column. Like any individual characteristic for example?

carrie 595 comments Joined 04/08
11 Apr 2012

You do not necessarily need to recollect statistics on columns unrelated to partitioning at the same time as you recollect on PARTITON and the partitioning columns. However, if the entire table has changed by 10%, it is usually advisable to do so. It depends on how growth has impacted the current statistics for those columns.

Generally, with PARTITION and partitioning column(s) you want to recollect if any INDIVIDUAL PARTITION has changed by 10%, rather than waiting until the entire table has changed by 10%. In that case you could probably wait until the table has changed by 10% to do the other recollections.

Just a reminder, If you have multicolumn statistics that include either PARTITION or the partitioning column, they need to be recollected when the number of rows in any one partition changes by 10%.

Thanks, -Carrie

Shrey Takkar 4 comments Joined 06/12
30 Jun 2012

Hello Carrie,

Thanks a lot for providing a quick summary for statistics collection recommendations.

I have following doubts in statistics collection using Teradata Statistics Wizard:-

1. What are stale statistics?

2. Collection of statistics is advisable to be collected as per recommendation for a particular workload or for all tables which are part of the workload?

3. If I define a workload using "Create from SQL statement" and then get recommendations for collecting statistics on that particular workload, recommendation would include statistics collection on columns/ indexes for the query mentioned in workload. Here it might happen that statistics are collected on some columns/ indexes which may be not required for other queries. In other words, would the statistics collected for a particular query be stale for other queries?

4. Is it advisable to collect statistics for all tables that are frequently used regularly?

Thanks,
Shrey

carrie 595 comments Joined 04/08
03 Jul 2012

Shrey, My responses are below your questions:

1. What are stale statistics?

Stale statistics come about when statistics have been collected in the past, but the table has grown since the stats were collected, so the statistics are no longer accurate. Only tables that undergo growth without statistics recollections will have stale statistics.

2. Collection of statistics is advisable to be collected as per recommendation for a particular workload or for all tables which are part of the workload?

These are general recommendations to be applied depending on need. They are recommendations at the table level.

3. If I define a workload using "Create from SQL statement" and then get recommendations for collecting statistics on that particular workload, recommendation would include statistics collection on columns/ indexes for the query mentioned in workload. Here it might happen that statistics are collected on some columns/ indexes which may be not required for other queries. In other words, would the statistics collected for a particular query be stale for other queries?

No. Statistics are on the table. All queries that access that table will use those statistics, if they need them. If the table has increased its row count, then its statistics may be stale. The optimizer will recognize that the statistics are stale for all queries that access the table and require that statistics. The concept of "staleness" is not query by query. A particular stale statistic may not be used for a given query against that table, in which it case it will not matter that it is stale. It only matters if a statistic is stale if a given query is making use of the column(s) the statistic represents.

4. Is it advisable to collect statistics for all tables that are frequently used regularly?

Yes. But decisions about which statsitics are appropriate should be based on query access patterns.

There are several orange books on the topic of statistics collection that are available in the orange book repository if you require additional information.

Thanks, -Carrie

Shrey Takkar 4 comments Joined 06/12
08 Jul 2012

Thanks a lot Carrie for your detailed response. It was of great help.
Please clarify the following "Query":-

For example, I have two queries involving complex joins. Query 1 is dependent on tables A, B, C and D whereas query 2 is dependent on tables B, C and E.

Tables B and C both are common. But the columns of joins being used (of tables B and C) in query 1 are different from columns of joins being used (of tables B and C) in query 2.

If we get recommendations using Teradata Statistics Wizard for Query 1, it may be different for query 2. So on executing the recommendations for each query, we end up with statistics being collected on tables B and C on columns which are present in join conditions of Query 1 but not for query 2. In other words staistics collected for query 1 won't be useful for query 2 or vice-versa. Going forward we may have statistics collected on almost every column in the table.

Should we collect statistics for column involved in joins?

Would same affect the performance of the query?

Or we can just collect statistics on all tables in producton that are being loaded frequently, irresepective of the queries being fired in database. Because in one or other way the queries would fetch data from the tables only which would statistics collected on them?

Thanks
Shrey

carrie 595 comments Joined 04/08
11 Jul 2012

In terms of join columns, if these are single-column join constraints, it is a good idea to collect statistics if the column is not unique. If the single column join column is defined as a unique column, statistics are not necessary.

Whether collecting statistics on a given column will impact the performance of a query, I would not be able to say for sure. I would not expect additional statistic collection to contribute to degradation. The more information you provide the optimizer, the better. You will have to try it out and decide for yourself if it is a useful statistic to collect.

Statistics are only used when queries access the tables. The purpose of statistics is to help the optimizer produce better query plans. If you have tables that no queries access, there will be little benefit in collecting statistics on those tables.

Thanks, -Carrie

SANJI 9 comments Joined 08/10
23 Nov 2012

What happens to the validity of the stats if a single column index is dropped ? (Considering that stats were collected on the index)
Internally the single column index's stats are stored as column stats and are not removed.
Do these get invalidated ?
Thanks

carrie 595 comments Joined 04/08
26 Nov 2012

Hi Sanji,

Indexes can be dropped on single column stats and the statistics will remain and will be valid and usable.

Collecting statistics on an INDEX versus the COLUMN that makes up the index is just a syntax difference. The collection and storage of the stats information is the same, and the impact on optimizer cost estimation is the same.

If you collect stats on a single column OR a single column index, the stats are stored in the DBC.TVFields row for the column.

However, if you collect while the index exists, Teradata may choose to scan the index rather than the table to do the COLLECT, regardless of whether COLUMN or INDEX- even if you used the COLUMN keyword in the COLLECT. That can reduce the time needed to recollect.

Thanks, -Carrie

SANJI 9 comments Joined 08/10
26 Nov 2012

Thanks.. Appreciate your response.

VP186001 4 comments Joined 11/12
12 Dec 2012

Carrie,

I've read everywhere that collecting STATS on the skewed columns is a must if they're being queries against.
In my project, I've seen that if I suppress the STATS (using a function on that column) on the skewed column/s, the queries normally perform better.
Can you please explain this phenomenon?

carrie 595 comments Joined 04/08
13 Dec 2012

The only time I have seen a situation where a statistics does not help a query plan, but actually results in a worse query plan, is if the length of the statistic exceeds 16 bytes, and the leading 16 bytes are not very distinct. In a case such as that, it is sometimes better to go without the statistic.

Once you get on 14.0, there are options that will allow you to specify values in the histogram to be larger than 16 bytes, and then the statistics will be more useful.

If your statistic length is less than 16 bytes and the plan is worse without the statistic than with the statistic, and nothing else has changed before and after, then I would open an incident with the support center on this so the circumstances around this can be examined in detail.

amittera 10 comments Joined 12/09
25 Apr 2013

Hi Carrie,
 
we have T12 , and i do not have any stats defined on the below tables which i got as recommendations from blogs:
 

DBC.TVM

     DBC.DBase

     DBC.TVFields

     DBC.AccessRights

     DBC.Indexes

     DBC.Profiles

     DBC.Owners

     DBC.Roles

     DBC.RoleGrants

     DBC.UDFInfo

 

We do not have any PDCR in place and we just do purging of dbqlogtbl on monthly basis by simple insert select. on discussion, i got the notion that if we are really not fetching dbc tables specifically dbqlogtbl, then we do not need. I have seen in big DW, those stats are getting collected on a daily basis, but for our system which is not much big, stats were not defined and hence not getting collected. Can you please suggest if we can define the stats and then make a process of getting them collected.

 

 

Also, are there any side effects of defining and collecting even we are not fetching them very often.

 

 

Amit Saxena
Teradata Consultant

Smarak0604 15 comments Joined 08/12
25 Apr 2013

Hello Carrie,
 
My Questions are related to your "Dropping MC Stats" Article & David Roth's Similar Article.
 
(a) You mentioned that the Data Content's Length matter, not the Column Length. Meaning, say Stats collected on (Col_1,Col_2) with Col_1 is Varchar(30) & Col_2 is Integer. So, What Matters is whether the Data in Col_1 exceeds 16 Byte or not, rather than the Column Length[Which is already more than 16 Bytes @ Varchar(30)].
In which case, ('ABCDEFHGIJKLMNOPQ',14) & ('ABCD',14) are Different.
 
(b) Stats(Col_1) and Stats(Col_1,Col_2) exists, with Col_1 nearly Unique[>= 95%]. David recommend Dropping Stats(Col_1,Col_2) based on Explain Capture on "Select Distinct(Col_1,Col_2) From Tablename", which is a Single-Table Operation.
 

As per you, if these 02 Columns are used in Joining Conditions or Aggregate Operations, these would be required for Unique Values Calculation, which in Independent of Truncation.
 

So, Is the Approach of Dropping Stats based on the David's Experiment alone [Single Table Experiment] Justified ? Cause, if both David and you are Correct, Dropping a MC Stats would require David's Approach PLUS digging around SQLs to verify that those Columns are not used in Joining or Aggregation Operations, which is an extremely difficult job.
 
(c) Assuming that Stats(Col_1,Col_2,Col_3) exists and all these 03 Columns are not used in Joining or Aggregate Operations, then the Dropping MC Stats' preferential order of approach would be:
(1) If first 16 Bytes same for (Col_1,Col_2,Col_3), Drop the Stats.
(2) If Unique(Col_1) >= say,95%, Drop Stats(Col_1,Col_2,Col_3) [Collect on Col_1 instead].
 
Meaning, first I check (1) and then (2).

amittera 10 comments Joined 12/09
26 Apr 2013

Is there any query which can identify which defined MC crosses 16 byte convention and can be dropped :) at database or table level. Also, incase we get some MC which can be dropped , how we can analyze the benefits i.e CPU/IO etc?
 

Amit Saxena
Teradata Consultant

carrie 595 comments Joined 04/08
29 Apr 2013

Amit,
I have an earlier posting on when to collect statistics on dictionary tables. 
 
http://developer.teradata.com/blog/carrie/2010/05/if-you-re-not-collecting-statistics-on-your-dictionary-tables-do-it-now
 
If you are not accessing those DBC tables below, then you do not need stats on them.  Sometime 3rd party tools access those tables for meta data purposes.  You can see if that is happening by looking at DBQL output.  If it is, then stats may be helpful in producing better query plans.   The blog posting above has more detail.
 
Thanks, -Carrie
 

carrie 595 comments Joined 04/08
29 Apr 2013

On the question of dropping multicolumn stats from Smarak0604:
Question a:   Yes, that is true.   If most of the time the value in a leading larger varchar is 16 bytes or less, probably no need to drop MC stats.
 
Question b:  That makes sense to me, but I'd look at a few explains that might be using those stats before and after dropping the MC stats.
 
Question c1:  Yes, makes sense.
 
Question c2:  Probably OK, but I'd check the explain before and after.
 
Thanks, -Carrie

carrie 595 comments Joined 04/08
29 Apr 2013

Amit,
 
I don't know of any query that can identify which multicolumn stats are greater than 16 bytes, although its possible someone has come up with one.   However, just because a MC stat is greater than 16 bytes does not mean it should be dropped.  There are a couple of reasons the statistic could be providing value:
 
- Even if the number of bytes is greater than 16, the number of distinct values will be correctly calculated, and in the case of the statistic being used in a join, information about the number of distinct values is valuable to the optimizer.
 
- It is only in the "value" fields of the histogram where the truncation to 16 bytes takes place.  If the first 16 bytes are sufficiently distinct then they may still providing value for query predicate puposes.
 
To understand whether or not dropping the stat is providing value, you could look at the explain before and after for queries that might change due to that stat, and also execute those queries before and after and look a the detail in DBQL, such as resource usage.
 
Thanks, -Carrie

amittera 10 comments Joined 12/09
30 Apr 2013

Thanks Carrie.

Amit Saxena
Teradata Consultant

Ramaiah_ch 1 comment Joined 11/12
30 Apr 2013

Hi Carrie,
How Optimizer's extrapolation technique will work ,
can you please explain us.
 
Regards
Ramaiah ch.
 

carrie 595 comments Joined 04/08
03 May 2013

Extrapolation for the 12.0 and the 13.0 releases is discussed in an earlier blog posting of mine here on Developer Exchange.  It is titled:
 
Why collect PARTITION statistics on Non-Partitioned tables?
 
However, be aware that that posting desribes how extrapolation is performed in 13.0 and 13.10 releases only.   Every major release we make some improvements to the process. 
Thanks, -Carrie

mohdmufid4u 4 comments Joined 10/10
18 Sep 2013

Hi Carrie,
I am a big fan of your and see you material very helpful, watched all your youtube video sessions on workload management :)  Thanks for helping all of us.
These questions are for Teradata 12 database
One of the major performance issue we are facing these days is the stats engine consuming over 1000 tperfs daily.
In this month in 16 days we have already spent 16000 Tperfs.
 
Ø  Is there a way by sql on which we can found the unwanted stats specially on big tables and how much reliable will that be ?
Ø  Can sample stats be collected on a table with skew upto 0 to 5% on larger tables , if yes only on index or all candidates define ? We have quite a few tables which are zero skew and are size > 500GB and we have combination of 10 more stats defined , can all these tables be onverted to sample stats?
Ø  Currently as per Teradata recommendation we have a setting in stats engine to collect stats whenever data changes by 10% , can this be increased to say 20% or more to reduce frequency of stats collection specially on larger tables ? If yes how much we can increase ?
Ø  Do we really need to collect stats every moth on tables like DBQL_HST etc on all columns already defines as stats candidate? we spend a handy tperf on stats collection on DBQL, understand that partition stats should always be refreshed else it may result to costly product joins.

carrie 595 comments Joined 04/08
24 Sep 2013

The AutoStats feature in 14.10 will be able to identify unused stats.   I don't know of any way in SQL to currently find that out, as there is no data stored in any database table that will tell you which stats were used or not used by the optimizer.   In 14.10 you will be able to collect use counts for objects such as statistics (a new capability), which will indicate how often, and when the last time was when the optimizer used a given statistic.
 
How much skew you can tolerate and still do sampling depends entirely on whether or not you still are getting good plans with sampling.  I don't know of any strict rule about when to avoid sampling, but in general the more distinct the data, and the more evenly distributed it is, the more you can safely rely on sampling.
 
Frequency of recollection is a judgement call and will depend on whether query plans are still good if you extend recollection thresholds.  10% change is a general guideline.  However, if you are still on 12.0, then the extrapolation capabilities are not as mature as they are on later releases, so you are more at risk if you do not recollect in the recommended way.  
 
Any data that is not changing may not require stats recollections.  Also for tables that are infrequently accessed you could decide it is of less value to spend much times recollecting stats, since so few queries run against them.   
 
Thanks, -Carrie

sunnysteve 7 comments Joined 10/13
15 Apr 2014

Hi Carrie,
I have Trasacation table that has a calenderdate column on which statistics have not been recently collected and I am submitting a range-based query that specified a date beyond the upper bound of the statistics collected for calenderdate...So what could be true about the optimizers extrapolation technique?

carrie 595 comments Joined 04/08
21 Apr 2014

When data distribution is relatively uniform extrapolation can offer reasonably accurate estimate when statistics are out of date.  When stale statistics are detected, upward adjustments are made to:
- Table row count
- Number of distinct values
- Number of rows estimated for single table selection
 
In Teradata 12, the optimizer detects stale statistics by looking at table row counts.  It compares table row counts from random AMP samples against row counts in collected statistics histograms.  Only table growth is detected, shrinkage is not.
 
In Teradata 12, all-AMP Random AMP sampling is recommended if the table’s primary index is skewed
 
If you have not added rows to the table (so the row counts are the same when you last collected stats and by random AMP sample at the time the query runs), then extrapolation will not do anything.  If the range based query asks for rows that don't exist, and have not been extrapolated to exist, then the optimizer will assume zero rows in that range.  The optimizer estimate will only change if extrapolation happens, and extrapolation only happens if the optimizer detects an upward change in the total number of rows in the table.
 
For more information on the detail of extrapolation, see the orange book Optimizer Cardinality Estimation Improvements Teradata Database 12.0 by Rama Krishna Korlapati.
 
For improvements to the optimizer's ability to extrapolate, see these later orange books:
 
Teradata 14.0 Statistics Enhancements (December 2011) and Teradata Database 14.10 Statistics Enhancements (October 2013), both by the same author as above.
 
Thanks, -Carrie

You must sign in to leave a comment.