All Forums Database
nyemul 14 posts Joined 09/12
05 Nov 2012
Slow Left Join query

 
 

Following query takes 14 minutes. Table OUT_JOIN1 contains 200001 rows. Table 

spa_usage1 contains app 54 million rows.

 

Stats are collected for both tables.

 

Will appreciate if you can share thoughts on how  performance of this query can be imporoved?

 

   select

      out1.S_SK,

      out1.C_SK,

      out1.S_ID,

      out1.C_ID,

      usg.P_SK,

      usg.SE_SK,

      usg.E_SK,

      usg.USG_DT_SK,

      out1.S_DT

   from

      MISC.OUT_JOIN1 out1 left join

      MISC.spa_usage1 usg

         on

         (

        out1.S_SK = usg.S_SK and

usg.USG_DT_SK > 17901

)        

 

  4) We do an all-AMPs RETRIEVE step from 270 partitions of MISC.usg

     with a condition of ("MISC.usg.USG_DT_SK >= 17902.") into Spool 2

     (all_amps), which is redistributed by the hash code of (

     MISC.usg.S_SK) to all AMPs.  Then we do a SORT to order

     Spool 2 by row hash.  The result spool file will not be cached in

     memory.  The size of Spool 2 is estimated with low confidence to

     be 567,176,439 rows (18,716,822,487 bytes).  The estimated time

     for this step is 3 hours and 10 minutes. 

  5) We do an all-AMPs JOIN step from MISC.out1 by way of a RowHash

     match scan with no residual conditions, which is joined to Spool 2

     (Last Use) by way of a RowHash match scan.  MISC.out1 and Spool 2

     are left outer joined using a merge join, with condition(s) used

     for non-matching on left table ("NOT (MISC.out1.S_SK IS NULL)"),

     with a join condition of ("MISC.out1.S_SK = S_SK"). 

     The result goes into Spool 1 (all_amps), which is built locally on

     the AMPs.  The result spool file will not be cached in memory. 

     The size of Spool 1 is estimated with low confidence to be

     567,276,441 rows (49,920,326,808 bytes).  The estimated time for

     this step is 1 hour and 31 minutes. 

 

 

CREATE TABLE MISC.spa_usage1 (

      S_SK DECIMAL(8,0)  NOT NULL,

      P_SK DECIMAL(8,0) NOT NULL,

      SE_SK DECIMAL(8,0)NOT NULL,

      E_SK DECIMAL(8,0) NOT NULL,

      USG_DT_SK DECIMAL(8,0) NOT NULL,

      ...

      ) 

PRIMARY INDEX (S_SK,C_SK)

PARTITION BY RANGE_N(CAST((USG_DT_SK ) AS INTEGER) BETWEEN 17898 AND 18170  EACH 1, NO RANGE);

 

 

CREATE TABLE MISC.OUT_JOIN1     (

      S_SK DECIMAL(8,0),

      C_SK DECIMAL(8,0),

      C_ID VARCHAR(32) ,

      S_ID VARCHAR(32) ,

      S_DT TIMESTAMP(0))

PRIMARY INDEX ( S_SK );

 

Thanks,

Niteen

 

 

ulrich 816 posts Joined 09/09
05 Nov 2012

some observations -
1. "spa_usage1 contains app 54 million rows" vs. "The size of Spool 2 is estimated with low confidence to be 567,176,439 rows" 
which is a 10x difference. Stats not up to date? or wrong expectations?
2. PPI "PARTITION BY RANGE_N(CAST((USG_DT_SK ) AS INTEGER) BETWEEN 17898 AND 18170  EACH 1, NO RANGE)" does not support at least this query well...
3. PI (S_SK,C_SK) indicate multiple rows for S_SK. Do you have highly / extreme skewed values?
 

feel free to donate bitcoin:12kgAUHFUqvG2sQgaRBXFhCwyf9HXdkGud

nyemul 14 posts Joined 09/12
05 Nov 2012

Hi Ulrich,
select count(*) from misc.SPA_USAGE1 shows rows as  575,641,759
 

 

DataBaseName TableName skew_factor

1 misc                           spa_usage1                     1.71

 

My apologies for providing incomplete/incorrect information about no of rows.

 

What can be done in order to improve partition on date?

 

Niteen

ulrich 816 posts Joined 09/09
05 Nov 2012

The skew value of the table is irrelevant as it is reflecting the S_SK,C_SK combination.
try 

select top 100 hashrow(S_SK), count(*)
from spa_usage1
group by 1 
order by 2 desc

PPI - for this specific table you might be better of without PPI...

feel free to donate bitcoin:12kgAUHFUqvG2sQgaRBXFhCwyf9HXdkGud

nyemul 14 posts Joined 09/12
05 Nov 2012

Hello Ulrich,
All rows have same count value.
 

 

HASHROW(S_SK)           Count(*)

-------------------  -----------

6A330209                   14144

9F8501AE                   14144

....

C8E817F5                   14144

CF0C31DB                   14144

92BE4FF9                   14144

 

It appears that this being equally distributed, there is no chance to improve on this front. 

I have learnt this method from you, as to how to know distribution of S_SK.

 

Niteen

 

ulrich 816 posts Joined 09/09
05 Nov 2012

But in this case it mean that for each matching S_SK your answer set will generate up to 14144 result  rows.
Is that what you expect?
Ulrich

feel free to donate bitcoin:12kgAUHFUqvG2sQgaRBXFhCwyf9HXdkGud

ratnamch 14 posts Joined 06/12
05 Nov 2012

 
sel day_of_year from sys_calendar.calendar where calendar_date=date
 
it displays no of days previous date means (2011/05/03)
how to write query

You must sign in to leave a comment.