All Forums Database
ricejfx 8 posts Joined 07/13
08 Jan 2014
Statistics Creation on Tables and Stat Subsets to Enable Faster Join Queries

Hello, 
Consider the following situation: 
Table Alpha has the following columns:
Col1
Col2
Col3
Col4
Col5
Primary Index (Col1, Col2, Col3, Col4)
 
Table Beta has the following columns:
Col1
Col2
Col3
Col6
Col7
Primary Index (Col1, Col2, Col3)
 
Table Gamma has the following columns:
Col1
Col2
Col8
Col9
Col10
Primary Index (Col1, Col2)
 
A user may want to join Alpha to Beta, Alpha to Gamma, or Beta to Gamma.  Generally speaking, does it suffice to only run the following statistics? 
Alpha - (Col1, Col2, Col3, Col4)
Beta - (Col1, Col2, Col3)
Gamma - (Col1, Col2)
 
Or would there be benefit to adding the following stats also?
Alpha - (Col1, Col2) // for join to Gamma
Alpha - (Col1, Col2, Col3) // for join to Beta
Beta - (Col1, Col2) // for join to Gamma
 
Posed a different way, can the Teradata Query Optimizer use subsets of statistics, or should there be specific statistics for specific potential join conditions?  Note that in our case, join indexes are not a viable solution, given our business conditions.  

M.Saeed Khurram 544 posts Joined 09/12
08 Jan 2014

Hi,
The best way to check the required stats is to use help stats diagnostics.
You can use it as follows:

DIAGNOSTIC HELPSTATS ON FOR SESSION;
Explian Your_Query

Then go to the end of explain plan and collect the recomended stats for the joins.
 
Secondly if your joins are based on Primary index then the stats collection on index columns will be enough to optimize joins.
 

Khurram

sm126877 2 posts Joined 10/09
02 Feb 2014

I would disagree that DIAGNOSTIC HELPSTATS is the "best" way because that will simply identify all stats variations that are pertinent to the query you are studying and not necessarily giving you a "minimum that you should collect" answer.
The answer to the original question should be, No, the Optimizer is not going to analyze subsets of statistics in the way that ricejfx was hoping because the scenario described only collects statistics on sets of columns.  If you instead collected stats on each individual column the Optimizer **will** consider permutations of the individual columns but at the potential risk of inaccuracy.
For example, if 4 different COLLECT STATISTICS statements were issued against table Alpha to collect on the individual columns Col1, Col2, Col3, and Col4 then permutations are considered but it is possible that signficant skewing associated with the key value ( Col1=9, Col2=9, Col3=9 ) **might** go unnoticed.  The Optimizer is often able to catch those situations with multi-column stats but not always so I would collect stats on the permutations if I really felt the risk is real.  In the end, "know your data" is the best way to decide.

Adeel Chaudhry 773 posts Joined 04/08
03 Feb 2014

DIAGNOSTIC HELPSTATS is just a guideline .... or a starting point for stats collection. It doesn't give you the complete need or end-to-end stats to be collected.
Usually, it tends to solve the issue for many cases, but again .... its not a complete solution for stats collection.

-- If you are stuck at something .... consider it an opportunity to think anew.

Adeel Chaudhry 773 posts Joined 04/08
03 Feb 2014

As for the original question .... what are other details like:
 
- Data-size
- Any other use of tables
- Is it joined only on those columns or can be joined for other columns as well
- Does Col1 in all tables have same data-type?
- Is there any casting involved? etc.
 
Collecting stats can have adverse action as well, so it is not recommended to collect stats on all (joining) columns. And you need to have a complete picture of how the table is going to be used.

-- If you are stuck at something .... consider it an opportunity to think anew.

You must sign in to leave a comment.