Accurately identifying how much a table has grown since statistics on a column were last collected is an important prerequisite for the statistics extrapolation process.  Once table growth has been detected, then stale statistics can be adjusted accordingly.  Extrapolations based on table growth will be be more accurate if statistics are being regularly collected on the system-derived column PARTITON, whether or not a table is partitioned.

As mentioned in the earlier posting of mine on this topic, a non-partitioned primary index (NPPI) table is treated as if it were just one big partition with a partition number of 0.   When PARTITION statistics are collected, whether on a PPI or a NPPI table, the cylinder index is scanned in order to establish the number of rows per partition.  In the case of a NPPI table, the number of rows in partition 0 will the same as the table row count.  Collecting statistics on PARTITON is a very fast operation because base table rows are not touched.

When the data distribution of a table is relatively uniform, extrapolation can offer reasonably accurate row count estimates when a histogram is outdated.  If stale data along with table growth is detected, extrapolation will make an upward adjustment to these values:

  • Table row count
  • Number of distinct values
  • Now of rows estimated for single table selection

The enhanced table growth detection mechanism where PARTITION statistics can make a difference requires that version 4 type histograms are being used.  Version 4 of the statistics histogram has been in use from Teradata and Teradata releases, and in all later releases. 

With Version 4 histograms, if a statistic being collected is on the PI or system-derived column PARTITION, then a special random AMP sample (RAS) is collected at the same time and is stuffed into the histogram.  This RAS is saved for future use with extrapolation--extrapolation for any index or column within that table that has had statistics aleady collected.

When a query is being optimized, a new random AMP sample (RAS) is always collected for each table in the request.  Additionally, PI or the PARTITION histograms for a given table will be looked for, and read if they exist.  If both PI and PARTITON histograms exist, the most recent will be used to determine the staleness of whatever other index or column statistic is needed on the same table.  

The old RAS value number of rows (taken at the time either the PI or the PARTITION statistic was collected) will be compared to the current RAS value number of rows that the optimizer just collected, as a means of identifying table growth.   The delta between those two row counts will indicate how much growth that table has undergone since the PI or PARTITION stats were collected.  

If neither PI nor PARTITION statistics are available, then a row count estimate from the current RAS is used to estimate the current table row count.  This current RAS estimate is compared against that statistic’s histogram row count.  Comparing actual rows counts in an old histogram against a current RAS estimate is a less accurate indicator of table growth than you would get if PARTITION or PI stats existed.  When the old and current RAS comparison is made, the determination of table growth is drawn from row count estimates originating from the same single AMP.  Unless there is skewing in how the table grows (some AMPs have added a lot more rows, other AMPs none), this approach has proven to be more accurate, particularly when rows are distributed unevenly across all AMPs.

The algorithm that compares old RAS with new RAS (where PARTITION statistics play a role) provides reasonable results in most cases.  But there are a couple of things to keep in mind with table growth detection:

  1. If tables have a lot of variable data, or are using compression, or have uneven data growth across AMPs, the estimate based on old RAS vs. new RAS may not be accurate.
  2. Only upward growth is detected, so if PPI table has dropped a partition and then added a partition of a similar size since statistics were last collected, the table would appear to have not grown and no extrapolation will take place.  

Be aware that if you attempt to use a sampling (USING SAMPLE) when collecting statistics on PARTITION, it will be ignored.   Internally the sample percent will be automatically raised to 100%, so you will always get full statistics when you collect on PARTITION.   The same recommendation applies (collect only full statistics) for multicolumn statistics that include PARTITION.    

A greater level of detail on how statistics extrapolation works is available in the Orange Book titled “Optimizer Statistics Extrapolations” by Rama Korlapati, September 2010.

mahesh 1 comment Joined 06/04
18 Feb 2012

Thanks Carrie.

Quick question? How accurate is the row count calculation from cyclinder INDEX? Is it always gives the exact total row count in the table?

The reason I am asking this, instead of doing a COUNT(*) against very large tables in ETL's (usually it is very expensive due to the full table scans) , Is it safe to do a PARTITION stats and then read the row count from the stats histogram (stored in the statistics column)?

Since collecting stats on PARTITION column is fast and not an expensive operation, this is way less expensive than doing a full table scan against a large table, right? A lot of ETL operation perform post ETL count(*) against the tables to capture the row count for data validation purposes. So if this PARTITION stats and reading the row count from the histogram serves the same purpose, Can we recommend that?

Bank of America

carrie 595 comments Joined 04/08
20 Feb 2012

Hi Mahesh.

Good to hear from you!

Row count calculation based on cylinder index does give an accurate row count. In addition, many CIs may already be in memory because they are used so often to find data blocks.

In Teradata 13.0, the SELECT COUNT(*) optimization pulls row count data from the cylinder indexes, rather than scanning the table. Each CI has data block descriptors (DBD) that represent each data block, including the row count for that data block. The DBD row counts are summed for each AMP in parallel. The sums are sent to the control AMP for global aggregation. PARTITION stats work in a similar way.

If you are already on 13.0, then you are benefiting from this optimization for SELECT COUNT(*). If you are not yet, then it may well be faster to collect statistics on PARTITION to get the current row count, as you describe.

For those on Teradata 13.10, if you have a new system (sorry, not you Mahesh) you will experience even more efficiency when collecting PARTITION stats or doing a SELECT COUNT(*). This is because new systems in 13.10 will use cylinders that are six times as large and can hold six time as many data blocks. So you may only need to read 1/6th the number of cylinders to establish the table's row count.

Thanks, -Carrie

AbeK 37 comments Joined 08/09
26 Feb 2012

Carrie, thanks for this article. I've implemented a stats process based on this rowcount from parttion stats and it has proved to be quite valuable for us. Now if only we could firgure out an easier way to identify unused stats definitions, it would be a huge benefit.

m.tahoon 4 comments Joined 09/11
03 Feb 2015

Do collecting [partition] only stats on a daily basis for all tables will cause extrapolation process to never work ?
we have an after data loadnig daily morning job which collects [partition] columns stats for all tables, and using this method row count from RAS of the partition Histogram taken daily morning will always be very simillar to the current RAS.
IF my understanding is correct, we will need to only collect partition column stats for tables do not have stats, but we need to keep the values of the partition column histograms InSync with other stats so optimizer adapt stats with growth if needed  ??

carrie 595 comments Joined 04/08
04 Feb 2015

Collecting statistics on PARTITION only applies in Teradata releases BEFORE 14.0.  After 14.0 you do not want to collect on PARTITION for all tables, only for partitioned tables.   In releases 14.0 and above, collect SUMMARY statistics on all non-partitioned tables, not PARTITION statistics.
Concerning your first question  "Do collecting [partition] only stats on a daily basis for all tables will cause extrapolation process to never work ?"
The answer is no.  Collecting stats on PARTITION will help extrapolation be more accurate for all statistics on the table.  The first paragraph of the blog posting above makes that point:  "Extrapolations based on table growth will be be more accurate if statistics are being regularly collected on the system-derived column PARTITON, whether or not a table is partitioned."
You only need to collect on PARTITION for tables that DO HAVE statistics, not on tables that DO NOT HAVE statistics.   If a table does not have statistics, there will be no statistics to do the extrapolation process on.  The purpose of extrapolation is when you don't have time to fully recollect statistics, then the optimizer will assess the growth of the table since the last statistics collection and project what some of the values in the statistics histogram will be.   If there is no histogram, there will be no extrapolation.
Having PARTITION stats that are very current will help the extrapolation of statistics on that same table be more accurate.
Thanks, -Carrie

m.tahoon 4 comments Joined 09/11
06 Feb 2015

Thank you Carrie for your feedback.
But Based on the explaination in the blog i can't understand how extrapolation will still work with a current statistics for partition/summary.
Let me exlpain my confusion with more details please
Assume we have a products_id table with total of 1000 products distributed evenly between 2 types (500 for each type); tonight loading job added another 1000 product id's also distributed evenly between same product types.
The morning collect stats on partition/summary columns only will update those column with the new row count of 2000 which will match the RAS row count.
in this case the extrapolation will never work since there's no real deviation between row count info. recorded partition/summary column stats and the RAS row count value; however the stats of the [product_type] columns still has info about only 500 products on each type when  product_type column stats had been collected before that loading job.
Thank you

carrie 595 comments Joined 04/08
10 Feb 2015

Growth detection relies on both RAS comparisons and row count comparisons.  Extraploation only happens if growth in a table that the statistic belongs to (let's call that statistic p_name) is detected. 
Starting in 12.0.3 but before 14.0 (before summary stats were available), a table's growth is detected in this way:
1.  PARTITION stats on a table include a column for the RAS row count at the time the PARTITION stats are collected
2.  The RAS stored in the PARTITION histogram is compared against the row count from current RAS to see if PARTITION stats are stale.
3.  If PARTITION stats are determined to be stale, the PARTITION row counts are adjusted upwards (in memory) to account for table growth
4.  Then the adjusted PARTITION stats row counts are compared against the row count from the statistic histogram (p_name)
5.  If there is a discrepany in row counts between the p_name histogram and the PARTITION row counts, extrapolation takes place for the p_name statistic
Starting in 14.0 (before the UseCount option was availble), a  table's growth is detected in this way:
1.  The summary stats (which include the RAS at the time they were collected) for the table are accessed
2.  Summary stats RAS is compared against current RAS row count
3.  If summary stats are stale, summary stat row count numbers are adjusted upwards (in memory) based on the delta with RAS
4.  Adjusted summary stats row counts are compared against the row count from the statistic histogram (p_name)
5.  If there is a discrepany in row counts between the p_name histogram and summary rows counts, extrapolation takes place
Thanks, -Carrie

You must sign in to leave a comment.