This article delves into the potential excessive use of Multi-Column (MC) Statistics and a process for identifying those that are not providing benefit and should be removed. This will reduce complexity and save resources on the re-collection of these redundant statistics.

Statistics and indexes are one (if not the most) discussed topics in the Teradata database tuning world. How many to add, of what type, recollection frequency, etc. In this article, I want to focus specifically on the potential for removal of MC statistics. You can view this article as a continuation of the multi-column statistics discussion started in Carrie’s blog on the dropping of Multi-Column Statistics, dated 20 July 2009. Here is the URL, and for those that have not read it, I would go read it first, as she does an excellent job of providing background, and then come back to this article: http://developer.teradata.com/blog/carrie/2009/07/should-you-drop-all-multicolumn-statistics-that-are-over-16-bytes

For a specific engagement, we found ourselves looking at a database where developers had “over-embraced” the addition of MC statistics on tables, to the point where in some cases the recollection of the statistics took an inordinate percentage of the overall resources. So our goal was to try to identify those MC statistics that were potentially not adding any benefit and could be removed.

Based on the working hypothesis that any MC combination whose first column, or combined length exceeded the 16 byte limit would probably not be giving additional benefit by the addition of other columns, we took at look at MC statistic combinations that also had statistics collected on the first column of the MC statistics combination.

In looking at the help stats command for the below table, you will see an interesting pattern in the three scenarios I am going to discuss. Notice, that for those MC statistic combinations starting with attribute_a or attribute_b the unique values are the same across all combinations, which was what we expected. However, this trend is not confirmed by the combinations starting with attribute_c, which was a surprise. 

HELP STATISTICS TARGETDB.TARGETTABLE;

Date       Time       Unique Values          Column Names  
========   ========   ====================   =============================================================================

09/10/13   13:14:41              1,509,232   ATTRIBUTE_A
09/10/13   13:13:20              1,509,232   ATTRIBUTE_A,COLUMN_A
09/10/13   13:13:44              1,509,232   ATTRIBUTE_A,ATTRIBUTE_C,COLUMN_A,COLUMN_C,COLUMN_D,COLUMN_E,COLUMN_F,COLUMN_G
09/10/13   13:14:19              1,509,232   ATTRIBUTE_A,COLUMN_I
09/10/13   13:13:27              1,509,232   ATTRIBUTE_A,COLUMN_H
09/10/13   13:13:50              1,509,232   ATTRIBUTE_A,COLUMN_C

09/10/13   13:14:13              1,509,232   ATTRIBUTE_B
09/10/13   13:15:12              1,509,232   ATTRIBUTE_B,ATTRIBUTE_A
09/10/13   13:14:22              1,509,232   ATTRIBUTE_B,COLUMN_B
09/10/13   13:14:28              1,509,232   ATTRIBUTE_B,ATTRIBUTE_C
09/10/13   13:14:59              1,509,232   ATTRIBUTE_B,ATTRIBUTE_C,COLUMN_A

09/10/13   13:14:52                  1,370   ATTRIBUTE_C
09/10/13   13:15:14                 95,661   ATTRIBUTE_C,COLUMN_J
09/10/13   13:14:36                  1,373   ATTRIBUTE_C,COLUMN_B
09/10/13   13:14:54                 77,343   ATTRIBUTE_C,COLUMN_A
09/10/13   13:15:16                 77,434   ATTRIBUTE_C,COLUMN_A,COLUMN_C
09/10/13   13:14:34                  1,628   ATTRIBUTE_C,COLUMN_I
09/10/13   13:15:22                  1,527   ATTRIBUTE_C,COLUMN_H
09/10/13   13:13:09                  1,428   ATTRIBUTE_C,COLUMN_C

To confirm the accuracy of the Unique Values shown in the help statistics, we performed an analysis consisting of the following steps:

1) Ran an explain on a query that consists of a “select (distinct(columnlist)) from table” to see how close the explain estimates matched the unique values, and what confidence was given.

 2) Ran the actual query, to compare the actual results to the explain estimates

3) We then removed the multi-column statistics, leaving only the statistics on the first column of the MC statistics group, and repeated steps 1 and 2

Here are the before removal results from steps 1 & 2. The columns are:

  • HELP STATS UNIQUE VALUES (same as from the HELP STATS)
  • EXPLAIN are the values coming from running the explain plan on the “select (distinct(columnlist)) from table” combinations
  • ACTUAL are the numbers coming from running the actual query
  • CONF - the explain confidence from the explain

As you can see the unique values are supported by the explains, which all had HIGH confidence and are also supported by the actual run of the queries.

Next, we removed the statistics from all of the MC Combinations. You will notice that for the first two scenarios, everything remained the same: the unique values are supported by the explains and actual runs, and had HIGH confidence.

However, for the third scenario, the explains all dropped to LOW confidence, and the explain numbers and actual run numbers deviated greatly from the first column values.

 

What do you think would cause scenario three to be different? Well, if you read Carrie’s Blog carefully, in the third paragraph she mentions differentiation in the first 16 bytes. This is key, because, it is not necessarily the length of the columns concatenated together that matters (in many of scenario #3’s column combinations the first two fields exceed 16 bytes) but rather the length of the data content of those columns concatenated together.

So, what we see happening in scenario #3 is that in all of the MC combination cases, enough of the data from the first two columns fits into the 16 bytes to greatly differentiate the “values”, and hence give us different demographics. This told us that our working hypothesis did not apply to all MC Statistic situations, and that you could remove MC Statistics where the MC unique values equaled the first columns unique values, but you should not necessarily remove MC Statistics where the unique values differed from the unique value of the first column, as that would lead to less confidence. This gave us a working rule set, depicted below:

 

And so, armed with this set of rules we developed a query that will look for this situation, and generate DROP STATISTICS statements for all MC Statistic combinations that fit the REMOVE criteria.  The query uses a version of Dieter Noeth's Stats_Info view (if you do not already have this, I'm sure a web search will turn it up: try http://www.teradataforum.com/attach.htm, or ask your friendly on-site Teradata PS Consultant).

Here is the query, with sample results for this table. If you want to see the intermediate results, simple run the DT select part. Give it a run and see what turns up. Let me and the DevX community know what your results are, especially if you find any substantial savings. We all want to hear about your success. 

SELECT 'DROP STATISTICS ON ' || TRIM(DATABASENAME) || '.' || TRIM(TABLENAME) ||
       ' COLUMN (' || TRIM(ColumnList) || ');' REMOVE_MC_STATS
FROM
(

SELECT SINGLE.DatabaseName
      ,SINGLE.TableName
      ,SINGLE.ColumnName (CHAR(30))
      ,SINGLE.NumOfValues (FORMAT 'ZZZ,ZZZ,ZZZ,ZZ9') FirstColumnValues

      ,CASE WHEN FirstColumnValues = MultiColumnValues
            THEN 'REMOVE' ELSE 'KEEP  ' END Recomendation

      ,MULTI.NumOfValues  (FORMAT 'ZZZ,ZZZ,ZZZ,ZZ9')  MultiColumnValues
      ,MULTI.ColumnName ColumnList

FROM
(
SELECT DatabaseName
      ,TableName
      ,ColumnName (CHAR(30))
      ,StatsType
      ,CollectDate
      ,SampleSize
      ,NumOfRows
      ,NumOfValues

 FROM TOOLSDB.Stats_Info

WHERE statstype = 'C'
  AND NumOfValues > 0
) SINGLE,

(
SELECT DatabaseName
      ,TableName
      ,ColumnName (CHAR(120))
      ,StatsType
      ,CollectDate
      ,SampleSize
      ,NumOfRows
      ,NumOfValues

 FROM TOOLSDB.Stats_Info

WHERE statstype = 'M'
  AND NumOfValues > 0
  AND COLUMNNAME <> 'PARTITION'
) MULTI

WHERE SINGLE.DATABASENAME = MULTI.DATABASENAME
  AND SINGLE.TABLENAME    = MULTI.TABLENAME
  AND SINGLE.COLUMNNAME   = 
      SUBSTR(MULTI.COLUMNNAME,1,POSITION(',' IN MULTI.COLUMNNAME) - 1)

  AND RECOMENDATION = 'REMOVE'

  AND SINGLE.DATABASENAME = 'TARGETDB'
  
-- ORDER BY 1,2,3

) DT

ORDER BY 1
;

REMOVE_MC_STATS                                                                                           
=============================================================================================================

DROP STATISTICS ON TARGETDB.TARGETTABLE COLUMN (ATTRIBUTE_A,COLUMN_A);                                     
DROP STATISTICS ON TARGETDB.TARGETTABLE COLUMN (ATTRIBUTE_A,ATTRIBUTE_C,COLUMN_A,COLUMN_C,COLUMN_D,COLUMN_E);
DROP STATISTICS ON TARGETDB.TARGETTABLE COLUMN (ATTRIBUTE_A,COLUMN_I);                                 
DROP STATISTICS ON TARGETDB.TARGETTABLE COLUMN (ATTRIBUTE_A,COLUMN_H);                             
DROP STATISTICS ON TARGETDB.TARGETTABLE COLUMN (ATTRIBUTE_A,COLUMN_C);                         

DROP STATISTICS ON TARGETDB.TARGETTABLE COLUMN (ATTRIBUTE_B,ATTRIBUTE_A); 
DROP STATISTICS ON TARGETDB.TARGETTABLE COLUMN (ATTRIBUTE_B,COLUMN_B);             
DROP STATISTICS ON TARGETDB.TARGETTABLE COLUMN (ATTRIBUTE_B,ATTRIBUTE_C);                      
DROP STATISTICS ON TARGETDB.TARGETTABLE COLUMN (ATTRIBUTE_B,ATTRIBUTE_C,COLUMN_A);            

And now for extra credit (or more homework, however you want to look at it :) the following query looks for tables that have MC Statistic combinations, but do NOT have statistics on the first column of the MC Statistics. You may find that you can add the single column statistics on the first column, get good results and be able to remove the MC Statistics. 

SELECT 'COLLECT STATISTICS ON ' || TRIM(DATABASENAME) || '.' || TRIM(TABLENAME) ||
       ' COLUMN (' || TRIM(SUBSTR(ColumnList,1,POSITION(',' IN ColumnList) - 1)) || ');' ADD_SINGLE_STATS
  FROM
(
SELECT MULTI.DatabaseName
      ,MULTI.TableName
      ,SINGLE.ColumnName (CHAR(30))
      ,CASE WHEN MULTI.NumOfRows = MULTI.NumOfValues
            THEN 'ADD SINGLE COLUMN' ELSE ' ' END Recomendation
      ,MULTI.NumOfRows
      ,MULTI.NumOfValues (FORMAT 'ZZZ,ZZZ,ZZZ,ZZ9') MultiColumnValues
      ,MULTI.ColumnName ColumnList
  FROM
(
SELECT DatabaseName
      ,TableName
      ,ColumnName (CHAR(120))
      ,StatsType
      ,CollectDate
      ,SampleSize
      ,NumOfRows
      ,NumOfValues

  FROM TOOLSDB.Stats_Info

 WHERE statstype = 'M'
   AND NumOfValues > 0
   AND COLUMNNAME NOT LIKE 'PARTITION%'
   AND POSITION(',' IN COLUMNNAME) > 0
) MULTI

LEFT OUTER JOIN
(
SELECT DatabaseName
      ,TableName
      ,ColumnName (CHAR(30))
      ,StatsType
      ,CollectDate
      ,SampleSize
      ,NumOfRows
      ,NumOfValues

  FROM TOOLSDB.Stats_Info

 WHERE statstype = 'C'
  AND NumOfValues > 0
) SINGLE

 ON SINGLE.DATABASENAME = MULTI.DATABASENAME
AND SINGLE.TABLENAME = MULTI.TABLENAME
AND SINGLE.COLUMNNAME = 
    SUBSTR(MULTI.COLUMNNAME,1,POSITION(',' IN MULTI.COLUMNNAME) - 1)

WHERE MULTI.DATABASENAME = 'TARGETDB' 

AND SINGLE.ColumnName IS NULL
AND Recomendation > ' '

-- ORDER BY 1,2,3
) DT
GROUP BY 1
ORDER BY 1
;

 

Good Luck!

Dave

 

Discussion
dnoeth 86 comments Joined 11/04
22 Nov 2009

Hi Dave,
good article, this triggers some possible enhancements to my stats query :-)

Hmmm, where's that "FieldCount" column in HELP STATS COLUMN calculated from?
A new column comparing the number of columns in the create statement vs. the actual number of columns in the stored stats would be quite usefull.

Btw, it's "Dieter Noeth" instead of "Dieter North", the "oe" is actually a german "o umlaut".
The stats query can be found at the Teradata Forum in the attachement area:
http://www.teradataforum.com/attach.htm

Dieter

Dieter

carrie 595 comments Joined 04/08
23 Nov 2009

Interesting approach. Thanks for posting, Dave.

I am guessing here, but I think that the first column in the multicolumn stats where the distinct values did not change were both unique columns (that would be attribute_A and attribute_B). When the first value in a multicolumn stat is unique, that statistic will report the same number of distinct values as just that single leading column does in its single-column stat, because you can't get more distinct than the one first column alone already is. I'm thinking that that is also why you get high confidence after dropping the multicolumn stats for the query that references all the columns in an aggregation. The optimizer sees the first column is unique in the aggregation, and therefore knows exactly how many rows the query will return. It doesn't need information about the other columns.

I tried this both with and without the leading column being unique, and got the same results as you did. A unique leading column gave me the same results as you got for attribute_A and attribute_B; non-unique gave me the same result as with attribute_C.

The 16-byte limit doesn't really come into play when number of distinct values are being calculated. The number of distinct values are calculated before data is placed in the detailed statistics intervals, and that calculation uses all columns in the multi-column stat. Only when the combined values are placed in the detailed intervals does the the truncation take place. The fields that carry the values in the detailed intervals are only 16 bytes long, so with several columns in your multicolumn stat, some truncation may take place there.

There is a different size limitation that comes into play during the calculation of number of distinct values, but that is applied to each CHAR, VARCHAR or BYTE column separately. Only the first 32 bytes of each such column will be included in that calculation, but all the columns will be represented.

Terasuda 3 comments Joined 12/09
02 Dec 2009

Hi Dave ,
Useful information, wish I was here couple of days ago.
I came across similar scenario where MC stats was negatively impacting the query and caused spool space errors. By checking the explain plan, I have dropped MC stats and got the query working.
For future analysis, I will leverage your code and run it on my tables and share the feedback.

David.Roth 17 comments Joined 05/09
03 Dec 2009

Glad you like it. And I sympathize; don't know how many times I have done sometihing, only to find out someone else had a script or process :)

Karam 25 comments Joined 07/09
01 Feb 2010

Very useful information on Stats. Does that mean that if a PI/UPI is having X unique value and PI/UPI + MVC is also having identical value of X , then MVC can be taken off ?
Irrespective of 16 byte logic?

David.Roth 17 comments Joined 05/09
01 Feb 2010

First, we are not really looking at PI's here, just the first field of a Multi Column statistics combination. That given, yes, in a nutshell we are saying that if the values for the first field are fairly unique, then you most probably will not get much gain from multi column combinations. Remember to always test before and after.

dtotten 2 comments Joined 12/07
11 Feb 2010

Dave - appreciate the content - This is exactly what I needed recently

Raghu J 1 comment Joined 08/07
05 Mar 2010

HI, this stuff is really cool. we have lot of multi columns stats with more than 16 bytes and i started this exercise.But got aproblem. We have colelcted stats on four columns (col1,col2,col3,col4) and col1 separately.

Here is the info

Help stats on col1: 3,684,013 unique values
Help stats on col1,col2,col3,col4 : 205,474,830
Actual count of distinct col1,col2,col3,col4 :205,474,830

I did a explain plan for select distinct col1,col2,col3,col4 from table. It is showing estimated row count as 3,684,013 (col1) instead of 205,474,830. Is this a strange behavior or can we correct it? Your help is greatly appreciated and we can reduce lot of CPU and IO. thanks in advance

asd

David.Roth 17 comments Joined 05/09
06 Mar 2010

So remember, the premise here is that if the help stats number is the same for the first column (col1) as it is for the combination of columns(col1-col4) then that is a "candidate" for removal. In this case that is not true as the single versus multi-column help stats are different.

Are you saying that you ran the script provided and it recommended removing this multi-column stat?

Tariq_Saeed 1 comment Joined 07/09
27 May 2010

Dave or Carrie,

Question -> What if we take this concept a step further and - instead of restricting ourselves to only the "SC STATS = MC STATS" case - eliminate all MC stats which are within a few percent of the First Column SC stats? Would that be a practical approach? If yes, what would be a moderately safe cutoff percentage (2% or closer to 10%)? I ask this since at our customer site there are MC stats which take over 200K CPU secs to collect and the values are within 1% of the First Col SC STATS. I know the best test would be to actually drop such stats and watch the explains, but before we do that it would be good to have your opinions.

BTW Dave, thanks for the post, this saved us a ton of work and as you would have guessed we are planning to implement this at our client.

- Tariq

David.Roth 17 comments Joined 05/09
27 May 2010

This is something that I had considered, as a potential second round of identification, using a comparision of multi-column stats values within x% of the single column stats.

I believe the idea has merit, but have not actually tested it out, so I have no recommended percentage to give you, i.e. works good at 97% or above.

However, I have to think that within a couple percent or values plans should be fairly similiar. As you indicated, you would need to run a test and check the explains.

jainayush007 8 comments Joined 03/11
04 Jan 2013

Hi Dieter,

I am unable to find Stats_Info view on the below link http://www.teradataforum.com/attach.htm

:- can you please help me on this?

dnoeth 86 comments Joined 11/04
04 Jan 2013

There's an older version on my stats query on the TeradataForum.
You'll find the latest version on DevEx:
http://developer.teradata.com/node/9598

Dieter

Dieter

Smarak0604 15 comments Joined 08/12
12 Apr 2013

Hello David & Carrie,
Thanks for the Article & Discussions.
I would like to wind up the whole 16-Byte Limit thing by a few questions/doubts:
(a) You mentioned that the Data Content's Length matter, not the Column Length. Meaning, say Stats collected on (Col_1,Col_2) with Col_1 is Varchar(30) & Col_2 is Integer. So, What Matters is whether the Data in Col_1 exceeds 16 Byte or not, rather than the Column Length[Which is already more than 16 Bytes @ Varchar(30)].
In which case, ('ABCDEFHGIJKLMNOPQ',14) & ('ABCD',14) are Different.
(b) Stats(Col_1) and Stats(Col_1,Col_2) exists, with Col_1 nearly Unique[>= 95%]. You recommend Dropping Stats(Col_1,Col_2) based on Explain Capture on "Select Distinct(Col_1,Col_2) From Tablename", which is a Single-Table Operation.
As per Carrie, if these 02 Columns are used in Joining Conditions or Aggregate Operations, these would be required for Unique Values Calculation, which in Independent of Truncation.
So, Is the Approach of Dropping Stats based on the above Experiment alone [Single Table Experiment] Justified ? Cause, if both Carrie and you are Correct, Dropping a MC Stats would require your Approach PLUS digging around SQLs to verify that those Columns are not used in Joining or Aggregation Operations, which is an extremely difficult job.
 

Smarak0604 15 comments Joined 08/12
12 Apr 2013

Another Thing Also.....
 
Assuming that Stats(Col_1,Col_2,Col_3) exists and all these 03 Columns are not used in Joining or Aggregate Operations, then the Dropping MC Stats' approach would be:
(1) If first 16 Bytes same for (Col_1,Col_2,Col_3), Drop the Stats.
(2) If Unique(Col_1) >= say,95%, Drop Stats(Col_1,Col_2,Col_3) [Collect on Col_1 instead].
 
I understand, such operations should be performed based upon Testing, I am looking at a General Approach.

Sourabh 1 comment Joined 07/10
03 Dec 2013

Hi,
I am unable to access below link, please help.
http://developer.teradata.com/blog/carrie/2009/07/should-you-drop-all-multicolumn-statistics-that-are-over-16-bytes
Thank you.

David.Roth 17 comments Joined 05/09
03 Dec 2013

It looks like the link has been unpublished. this is probably due to the fact that in the latest releases of Teradata the multi-column statistics work slightly different now.

Rahul 3 comments Joined 05/09
05 Mar 2014

Hi,
Related to multi column statistics, one observation is that if any of the columns in a multi column stat has NULL value for all rows in the table, then the entire multi-column stat is considered NULL. That is, the non-NULL values of other columns in the multi-column stat are ignored. Is this by design ? If so, can someone explain why please ?
Thanks.
- Rahul.

13 Mar 2014

Thanks David for this, It really helped me to drop the unwanted multi column stats.
I can see that we completely avoided the PIs and Partition columns here. But what if single column stats is a PI stats, and mutli column is of PI column + some other column. In that case, if the unique values matches, we can still drop the mulit column stats(?).
Instead of keeping filter on 'Col' stats, can we have all single column stats (may be we can keep filter on COLUMNCOUNT)

David.Roth 17 comments Joined 05/09
22 Mar 2014

Sivakumar,
That would makes sense to me if the additional non-PI column did not make any demographic changes, like all rows had the same status value and multi-column stats were on PI+Status.
I excluded the PI ones because I would not expect to see that situation often, for this particular reason.
Sometimes you will see PI+Partition, or PI+Partitioning columns. These do provide some specific information to the optimizer in some Teradata versions
Dave
 
 
 

David.Roth 17 comments Joined 05/09
22 Mar 2014

Rahul,
I don not personally know the reason for this, but this is the way things work. For instance if you concatenate a few fields and one of them is NULL the result is also NULL.
This seems consistent with the multi-column stats rules
Dave

Jinesh 2 comments Joined 05/07
04 Nov 2014

Hi Dave, does this work on the Dieter's TD14 StatsInfo query?

You must sign in to leave a comment.