# Removing Multi-Column Statistics – A Process for Identification of Redundant Statistics

This article delves into the potential excessive use of Multi-Column (MC) Statistics and a process for identifying those that are not providing benefit and should be removed. This will reduce complexity and save resources on the re-collection of these redundant statistics.

Statistics and indexes are one (if not the most) discussed topics in the Teradata database tuning world. How many to add, of what type, recollection frequency, etc. In this article, I want to focus specifically on the potential for removal of MC statistics. You can view this article as a continuation of the multi-column statistics discussion started in Carrie’s blog on the dropping of Multi-Column Statistics, dated 20 July 2009. Here is the URL, and for those that have not read it, I would go read it first, as she does an excellent job of providing background, and then come back to this article: http://developer.teradata.com/blog/carrie/2009/07/should-you-drop-all-multicolumn-statistics-that-are-over-16-bytes

For a specific engagement, we found ourselves looking at a database where developers had “over-embraced” the addition of MC statistics on tables, to the point where in some cases the recollection of the statistics took an inordinate percentage of the overall resources. So our goal was to try to identify those MC statistics that were potentially not adding any benefit and could be removed.

Based on the working hypothesis that any MC combination whose first column, or combined length exceeded the 16 byte limit would probably not be giving additional benefit by the addition of other columns, we took at look at MC statistic combinations that also had statistics collected on the first column of the MC statistics combination.

In looking at the help stats command for the below table, you will see an interesting pattern in the three scenarios I am going to discuss. Notice, that for those MC statistic combinations starting with attribute_a or attribute_b the unique values are the same across all combinations, which was what we expected. However, this trend is not confirmed by the combinations starting with attribute_c, which was a surprise.

HELP STATISTICS TARGETDB.TARGETTABLE; Date Time Unique Values Column Names ======== ======== ==================== ============================================================================= 09/10/13 13:14:41 1,509,232 ATTRIBUTE_A 09/10/13 13:13:20 1,509,232 ATTRIBUTE_A,COLUMN_A 09/10/13 13:13:44 1,509,232 ATTRIBUTE_A,ATTRIBUTE_C,COLUMN_A,COLUMN_C,COLUMN_D,COLUMN_E,COLUMN_F,COLUMN_G 09/10/13 13:14:19 1,509,232 ATTRIBUTE_A,COLUMN_I 09/10/13 13:13:27 1,509,232 ATTRIBUTE_A,COLUMN_H 09/10/13 13:13:50 1,509,232 ATTRIBUTE_A,COLUMN_C 09/10/13 13:14:13 1,509,232 ATTRIBUTE_B 09/10/13 13:15:12 1,509,232 ATTRIBUTE_B,ATTRIBUTE_A 09/10/13 13:14:22 1,509,232 ATTRIBUTE_B,COLUMN_B 09/10/13 13:14:28 1,509,232 ATTRIBUTE_B,ATTRIBUTE_C 09/10/13 13:14:59 1,509,232 ATTRIBUTE_B,ATTRIBUTE_C,COLUMN_A 09/10/13 13:14:52 1,370 ATTRIBUTE_C 09/10/13 13:15:14 95,661 ATTRIBUTE_C,COLUMN_J 09/10/13 13:14:36 1,373 ATTRIBUTE_C,COLUMN_B 09/10/13 13:14:54 77,343 ATTRIBUTE_C,COLUMN_A 09/10/13 13:15:16 77,434 ATTRIBUTE_C,COLUMN_A,COLUMN_C 09/10/13 13:14:34 1,628 ATTRIBUTE_C,COLUMN_I 09/10/13 13:15:22 1,527 ATTRIBUTE_C,COLUMN_H 09/10/13 13:13:09 1,428 ATTRIBUTE_C,COLUMN_C

To confirm the accuracy of the Unique Values shown in the help statistics, we performed an analysis consisting of the following steps:

1) Ran an explain on a query that consists of a “select (distinct(columnlist)) from table” to see how close the explain estimates matched the unique values, and what confidence was given.

2) Ran the actual query, to compare the actual results to the explain estimates

3) We then removed the multi-column statistics, leaving only the statistics on the first column of the MC statistics group, and repeated steps 1 and 2

Here are the before removal results from steps 1 & 2. The columns are:

- HELP STATS UNIQUE VALUES (same as from the HELP STATS)
- EXPLAIN are the values coming from running the explain plan on the “select (distinct(columnlist)) from table” combinations
- ACTUAL are the numbers coming from running the actual query
- CONF - the explain confidence from the explain

As you can see the unique values are supported by the explains, which all had HIGH confidence and are also supported by the actual run of the queries.

Next, we removed the statistics from all of the MC Combinations. You will notice that for the first two scenarios, everything remained the same: the unique values are supported by the explains and actual runs, and had HIGH confidence.

However, for the third scenario, the explains all dropped to LOW confidence, and the explain numbers and actual run numbers deviated greatly from the first column values.

What do you think would cause scenario three to be different? Well, if you read Carrie’s Blog carefully, in the third paragraph she mentions differentiation in the first 16 bytes. This is key, because, it is not necessarily the length of the columns concatenated together that matters (in many of scenario #3’s column combinations the first two fields exceed 16 bytes) but rather the **length of the data content **of those columns concatenated together.

So, what we see happening in scenario #3 is that in all of the MC combination cases, enough of the data from the first two columns fits into the 16 bytes to greatly differentiate the “values”, and hence give us different demographics. This told us that our working hypothesis did not apply to all MC Statistic situations, and that you could remove MC Statistics where the MC unique values equaled the first columns unique values, but you should not necessarily remove MC Statistics where the unique values differed from the unique value of the first column, as that would lead to less confidence. This gave us a working rule set, depicted below:

And so, armed with this set of rules we developed a query that will look for this situation, and generate DROP STATISTICS statements for all MC Statistic combinations that fit the REMOVE criteria. The query uses a version of Dieter Noeth's Stats_Info view (if you do not already have this, I'm sure a web search will turn it up: try http://www.teradataforum.com/attach.htm, or ask your friendly on-site Teradata PS Consultant).

Here is the query, with sample results for this table. If you want to see the intermediate results, simple run the DT select part. Give it a run and see what turns up. Let me and the DevX community know what your results are, especially if you find any substantial savings. We all want to hear about your success.

SELECT 'DROP STATISTICS ON ' || TRIM(DATABASENAME) || '.' || TRIM(TABLENAME) || ' COLUMN (' || TRIM(ColumnList) || ');' REMOVE_MC_STATS FROM ( SELECT SINGLE.DatabaseName ,SINGLE.TableName ,SINGLE.ColumnName (CHAR(30)) ,SINGLE.NumOfValues (FORMAT 'ZZZ,ZZZ,ZZZ,ZZ9') FirstColumnValues ,CASE WHEN FirstColumnValues = MultiColumnValues THEN 'REMOVE' ELSE 'KEEP ' END Recomendation ,MULTI.NumOfValues (FORMAT 'ZZZ,ZZZ,ZZZ,ZZ9') MultiColumnValues ,MULTI.ColumnName ColumnList FROM ( SELECT DatabaseName ,TableName ,ColumnName (CHAR(30)) ,StatsType ,CollectDate ,SampleSize ,NumOfRows ,NumOfValues FROM TOOLSDB.Stats_Info WHERE statstype = 'C' AND NumOfValues > 0 ) SINGLE, ( SELECT DatabaseName ,TableName ,ColumnName (CHAR(120)) ,StatsType ,CollectDate ,SampleSize ,NumOfRows ,NumOfValues FROM TOOLSDB.Stats_Info WHERE statstype = 'M' AND NumOfValues > 0 AND COLUMNNAME <> 'PARTITION' ) MULTI WHERE SINGLE.DATABASENAME = MULTI.DATABASENAME AND SINGLE.TABLENAME = MULTI.TABLENAME AND SINGLE.COLUMNNAME = SUBSTR(MULTI.COLUMNNAME,1,POSITION(',' IN MULTI.COLUMNNAME) - 1) AND RECOMENDATION = 'REMOVE' AND SINGLE.DATABASENAME = 'TARGETDB' -- ORDER BY 1,2,3 ) DT ORDER BY 1 ; REMOVE_MC_STATS ============================================================================================================= DROP STATISTICS ON TARGETDB.TARGETTABLE COLUMN (ATTRIBUTE_A,COLUMN_A); DROP STATISTICS ON TARGETDB.TARGETTABLE COLUMN (ATTRIBUTE_A,ATTRIBUTE_C,COLUMN_A,COLUMN_C,COLUMN_D,COLUMN_E); DROP STATISTICS ON TARGETDB.TARGETTABLE COLUMN (ATTRIBUTE_A,COLUMN_I); DROP STATISTICS ON TARGETDB.TARGETTABLE COLUMN (ATTRIBUTE_A,COLUMN_H); DROP STATISTICS ON TARGETDB.TARGETTABLE COLUMN (ATTRIBUTE_A,COLUMN_C); DROP STATISTICS ON TARGETDB.TARGETTABLE COLUMN (ATTRIBUTE_B,ATTRIBUTE_A); DROP STATISTICS ON TARGETDB.TARGETTABLE COLUMN (ATTRIBUTE_B,COLUMN_B); DROP STATISTICS ON TARGETDB.TARGETTABLE COLUMN (ATTRIBUTE_B,ATTRIBUTE_C); DROP STATISTICS ON TARGETDB.TARGETTABLE COLUMN (ATTRIBUTE_B,ATTRIBUTE_C,COLUMN_A);

And now for extra credit (or more homework, however you want to look at it :) the following query looks for tables that have MC Statistic combinations, but do NOT have statistics on the first column of the MC Statistics. You may find that you can add the single column statistics on the first column, get good results and be able to remove the MC Statistics.

SELECT 'COLLECT STATISTICS ON ' || TRIM(DATABASENAME) || '.' || TRIM(TABLENAME) || ' COLUMN (' || TRIM(SUBSTR(ColumnList,1,POSITION(',' IN ColumnList) - 1)) || ');' ADD_SINGLE_STATS FROM ( SELECT MULTI.DatabaseName ,MULTI.TableName ,SINGLE.ColumnName (CHAR(30)) ,CASE WHEN MULTI.NumOfRows = MULTI.NumOfValues THEN 'ADD SINGLE COLUMN' ELSE ' ' END Recomendation ,MULTI.NumOfRows ,MULTI.NumOfValues (FORMAT 'ZZZ,ZZZ,ZZZ,ZZ9') MultiColumnValues ,MULTI.ColumnName ColumnList FROM ( SELECT DatabaseName ,TableName ,ColumnName (CHAR(120)) ,StatsType ,CollectDate ,SampleSize ,NumOfRows ,NumOfValues FROM TOOLSDB.Stats_Info WHERE statstype = 'M' AND NumOfValues > 0 AND COLUMNNAME NOT LIKE 'PARTITION%' AND POSITION(',' IN COLUMNNAME) > 0 ) MULTI LEFT OUTER JOIN ( SELECT DatabaseName ,TableName ,ColumnName (CHAR(30)) ,StatsType ,CollectDate ,SampleSize ,NumOfRows ,NumOfValues FROM TOOLSDB.Stats_Info WHERE statstype = 'C' AND NumOfValues > 0 ) SINGLE ON SINGLE.DATABASENAME = MULTI.DATABASENAME AND SINGLE.TABLENAME = MULTI.TABLENAME AND SINGLE.COLUMNNAME = SUBSTR(MULTI.COLUMNNAME,1,POSITION(',' IN MULTI.COLUMNNAME) - 1) WHERE MULTI.DATABASENAME = 'TARGETDB' AND SINGLE.ColumnName IS NULL AND Recomendation > ' ' -- ORDER BY 1,2,3 ) DT GROUP BY 1 ORDER BY 1 ;

Good Luck!

Dave

Hi Dave,

good article, this triggers some possible enhancements to my stats query :-)

Hmmm, where's that "FieldCount" column in HELP STATS COLUMN calculated from?

A new column comparing the number of columns in the create statement vs. the actual number of columns in the stored stats would be quite usefull.

Btw, it's "Dieter Noeth" instead of "Dieter North", the "oe" is actually a german "o umlaut".

The stats query can be found at the Teradata Forum in the attachement area:

http://www.teradataforum.com/attach.htm

Dieter

Dieter

Interesting approach. Thanks for posting, Dave.

I am guessing here, but I think that the first column in the multicolumn stats where the distinct values did not change were both unique columns (that would be attribute_A and attribute_B). When the first value in a multicolumn stat is unique, that statistic will report the same number of distinct values as just that single leading column does in its single-column stat, because you can't get more distinct than the one first column alone already is. I'm thinking that that is also why you get high confidence after dropping the multicolumn stats for the query that references all the columns in an aggregation. The optimizer sees the first column is unique in the aggregation, and therefore knows exactly how many rows the query will return. It doesn't need information about the other columns.

I tried this both with and without the leading column being unique, and got the same results as you did. A unique leading column gave me the same results as you got for attribute_A and attribute_B; non-unique gave me the same result as with attribute_C.

The 16-byte limit doesn't really come into play when number of distinct values are being calculated. The number of distinct values are calculated before data is placed in the detailed statistics intervals, and that calculation uses all columns in the multi-column stat. Only when the combined values are placed in the detailed intervals does the the truncation take place. The fields that carry the values in the detailed intervals are only 16 bytes long, so with several columns in your multicolumn stat, some truncation may take place there.

There is a different size limitation that comes into play during the calculation of number of distinct values, but that is applied to each CHAR, VARCHAR or BYTE column separately. Only the first 32 bytes of each such column will be included in that calculation, but all the columns will be represented.

Hi Dave ,

Useful information, wish I was here couple of days ago.

I came across similar scenario where MC stats was negatively impacting the query and caused spool space errors. By checking the explain plan, I have dropped MC stats and got the query working.

For future analysis, I will leverage your code and run it on my tables and share the feedback.