All Forums Database
m.tahoon 43 posts Joined 09/11
12 Jul 2012
Huge Cardinality estimates for TIMESTAMP COLUMNS

estimated cardinality for the following predicate towards timestamp column is far too high vs date column

1) Table DDL:

--------------------------------------

CREATE MULTISET TABLE tdb.subscription_offering_hist ,NO FALLBACK ,
xxxx    (
      Subscription_Id INTEGER,
      Offering_Id INTEGER,
      Subscription_Offering_Start_Dt TIMESTAMP(0),
......

     DWH_Entry_Date DATE FORMAT 'YY/MM/DD')
PRIMARY INDEX ( Subscription_Id );

 

2) predicate towards date column:

-----------------------------------------

SELECT count(Subscription_Id)
FROM devdb.soh
WHERE  CAST( '2012-05-31' AS DATE )  
BETWEEN DWH_Entry_Date
AND DWH_Entry_Date + 150
;

Actual rows returned : 162,480
Estimated rows returned: 161,085

 

 

3) predicate towards Timestamp column:

-----------------------------------------

SELECT Subscription_Id
FROM devdb.soh
WHERE CAST( CAST( '2012-05-31' AS DATE ) AS TIMESTAMP(0))
BETWEEN subscription_offering_start_dt
AND subscription_offering_start_dt  + ((TIME '23:59:59'  - TIME '00:00:00') HOUR TO SECOND)  

Actual rows returned : 1,724
Estimated rows returned: 96,960

 

 

What's strange that i've tried different values for the timestamp predicate and whichever the value use; optimizer always estimates 96,960 row; while it;s not the case if we run the predicate for date column; different date sure return different estimations.

 

i've collected needed stats; also this is a test table so it;s not a corrupted stats problem since i've dropped & recreated stats on original table and same issue presisted.

 

Regards

dnoeth 4628 posts Joined 11/04
12 Jul 2012

When you check Explain you should see different confidence levels:
HIGH for #2
NO for #3, this indicates that stats could not be used, so it defaults to 1/3 of the rows due to the BETWEEN

For #2 the optimizer rewrote your condition first to 
WHERE DWH_Entry_Date BETWEEN   
DATE '2012-05-31' - 150 AND DATE '2012-05-31'

and finally to
WHERE DWH_Entry_Date BETWEEN   
DATE '2012-01-02' - 150 AND DATE '2012-05-31'

But this kind of optimization is only done for INTEGERs and DATEs, not for TIMESTAMPs.

You have to rewrite it on your own to remove the calculation on the DWH_Entry_Date column for #3:

WHERE subscription_offering_start_dt
BETWEEN CAST(DATE '2012-05-31' AS TIMESTAMP(0))
AND CAST(DATE '2012-05-31' AS TIMESTAMP(0)) + INTERVAL '23:59:59' HOUR TO SECOND

or

WHERE subscription_offering_start_dt
BETWEEN TIMESTAMP '2012-05-31 00:00:00'
AND TIMESTAMP '2012-05-31 23:59:59'

This should result in HIGH confidence again.

Dieter

Dieter

m.tahoon 43 posts Joined 09/11
15 Jul 2012

Thank You Dieter for the important information:

However:

SELECT Subscription_Id
FROM devdb.soh
WHERE CAST( CAST( '2012-05-31' AS DATE ) AS TIMESTAMP(0)) + ((TIME '23:59:59'  - TIME '00:00:00') HOUR TO SECOND)  
BETWEEN subscription_offering_start_dt
AND subscription_offering_start_dt  + ((TIME '23:59:59'  - TIME '00:00:00') HOUR TO SECOND)  

EST: 76,574,794
ACT:  1,956,001

SELECT COUNT(Subscription_id) FROM  vdb.subscription_offering_hist c
WHERE c.subscription_offering_start_dt
BETWEEN TIMESTAMP '2012-05-31 00:00:00'
AND TIMESTAMP '2012-05-31 23:59:59';

EST: 74,069
ACT: 1,956,001

 

>> The size of Spool 1 is estimated with high confidence to be 74,069 rows !

 

dnoeth 4628 posts Joined 11/04
15 Jul 2012

What's your TD release?

The high confidence indicates stale stats, could you show the output of HELP STATS for vdb.subscription_offering_hist?

Dieter

Dieter

m.tahoon 43 posts Joined 09/11
15 Jul 2012

i thought high confidence means stats are good  current !

 

we are running 13.10

 

 Date Time              Unique Values     Column Names
------ ------                ------------------   --------------------

12/07/14 23:07:50           43,775,363 Subscription_Id
12/07/13 16:13:55           151            Offering_Id
3 12/07/13 16:14:02         9,396,526   Subscription_Offering_Start_Dt
4 12/07/13 16:14:03         1,255,879   Subscription_Offering_End_Dttm
5 12/07/15 09:11:36                    1 PARTITION
 

 

dnoeth 4628 posts Joined 11/04
15 Jul 2012

Sorry, i wasn't clear. Of course "high confidence" usually indicates that there are stats (and the optimizer trusts them).

But in this special case (EST: 74,069 vs. ACT: 1,956,001) i assumed the stats could be outdated, resulting in wrong estimates.

Your stats are current, so this under-estimation by the factor of 26 must have a different reason:

Is there an even distribution of values or are offerings sent on specific days within a month?

Are the stats sampled?

When you incease the range of dates is the estimate closer to the actual number

Dieter

Dieter

m.tahoon 43 posts Joined 09/11
15 Jul 2012

NO it;s not sampled.

with wider range ( the whole month)  the ratio changed from 1:10

 

SELECT Subscription_id FROM  tdb.subscription_offering_hist c
WHERE c.subscription_offering_start_dt
BETWEEN TIMESTAMP '2012-05-01 00:00:00'
AND TIMESTAMP '2012-05-31 23:59:59'
;

EST: 97,176,312
ACT: 11,057,406

 

i still think there another since we don't encounter such issue with date column.

dnoeth 4628 posts Joined 11/04
15 Jul 2012

This is strange,
- increasing the range from 1 day to 31 days: 31x
- actual rows: 6x (-> not evenly ditributed)
- estimated rows: 1300x (74,069 -> 11,057,406)

Might be a problem specific to Timestamps and/or your actual data, did you already inspect the stats using "HELP STATS tdb.subscription_offering_hist COLUMN subscription_offering_start_dt"?

Maybe you should ask TD support if this is a known problem.

Dieter

Dieter

You must sign in to leave a comment.