All Forums Database
Boopathi15 11 posts Joined 12/13
07 Oct 2014
Need help in Optimizing Row_Number/RANK() operations

Hello

 

I have source of transactional data..  For a particular primary key I need to compare the current row with previous row to see if the data received in current row is exact duplicate. If so i can drop the current row (INTRA SCD).

 

To do this operation I'm using the Ordered analytical function( Row_Number()). My table is huge and it contains around 50 million data, So row_number operation is very long time and lot of CPU's. I can see in explain plan that it always goes for full table scan and redistributes it again.How to improve this

 

Table structure :

 

CREATE TABLE SOURCE_DATA
(
KEY1 varchar(50),
KEY2 varchar(50),
KEY3 varchar(50),
Data_Value varchar(500),
Transaction_dttm timestamp(6))
)
Primary index ( key1,key2,key3);

 

Row_Number syntax:

 

Select 
a.*
row_Number() over(partition by key1,key2,key3 order by Transaction_dttm asc)
FROM SOURCE_DATA;

 

Please suggest...

 

 

 

 

Raja_KT 1246 posts Joined 07/09
07 Oct 2014

You can try something like this:
select .....,your_field, max(your_field) over (order by your_field rows between 1 preceding and 1 preceding) m
qualify m<your_field----- <for test>
 

Raja K Thaw
My wiki: http://en.wikipedia.org/wiki/User:Kt_raj1
Street Children suffer not by their fault. We can help them if we want.

Boopathi15 11 posts Joined 12/13
07 Oct 2014

I cant use that because the sample kind of logic is present in other tables where I've lot of the Data Values(around 20 values) , so I need to have 20 analytical functions which makes things much worse..

 

 

Let me explain my case with simple examples..

 

Source Data

 

Key1     Key2        Data_value1             Data_value2    Transaction_Dttm
1        a           x                       y              2014-01-01 01:00:00
1        a           x                       b              2014-01-01 02:00:00
1        a           x                       b              2014-01-01 03:00:00
1        a           x                       y              2014-01-01 04:00:00
2        a           z                       z              2014-01-01 05:00:00
2        a           z                       z              2014-01-01 06:00:00

 

Expected Output

 

Key1     Key2        Data_value1           Data_value2  Valid_From_Dttm          Valid_To
1        a           x                     y            2014-01-01 01:00:00      2014-01-01 01:59:59
1        a           x                     b            2014-01-01 02:00:00      2014-01-01 03:59:59
1        a           x                     y            2014-01-01 04:00:00      9999-12-31 23:59:59
2        a           z                     z            2014-01-01 05:00:00      9999-12-31 23:59:59

 

How can i efficiently Transform like this ?

Raja_KT 1246 posts Joined 07/09
07 Oct 2014

What is the logic, that you arrive at valid_to  from the output of ist row , 2nd row, 3rd row, 4th row?

Raja K Thaw
My wiki: http://en.wikipedia.org/wiki/User:Kt_raj1
Street Children suffer not by their fault. We can help them if we want.

dnoeth 4628 posts Joined 11/04
08 Oct 2014

Can you show your current process?
And what's your TD release?

Dieter

Boopathi15 11 posts Joined 12/13
15 Oct 2014

Hi Raja
I'll compare the current row with the Next row and drop the next row if the data value is same. 
If the data value is different then Valid_To_dttm of current row will be next row's Transaction_dttm-1 second. This is to avoid duplications of data.
 

Boopathi15 11 posts Joined 12/13
15 Oct 2014

Hi Dieter
We have teradata 13.10 version.
Currently I'm creating the row number for each record for the same primary key and join with same table based on primary key +row number ,. something like this.
 
Consider this is the source data : Table_01

Key1     Key2        Data_value1             Data_value2    Transaction_Dttm
1        a           x                       y              2014-01-01 01:00:00
1        a           x                       b              2014-01-01 02:00:00
1        a           x                       b              2014-01-01 03:00:00
1        a           x                       y              2014-01-01 04:00:00
2        a           z                       z              2014-01-01 05:00:00
2        a           z                       z              2014-01-01 06:00:00

 

 

Step 2 : Adding row number : table_02

 

ins into table_02
select
a.*
,row_number() OVER (partition key1,key2 order by Transaction_dttm asc) as rownumber
from table_01 a;

 

Key1     Key2        Data_value1             Data_value2    Transaction_Dttm Rownumber
1        a           x                       y              2014-01-01 01:00:00 1
1        a           x                       b              2014-01-01 02:00:00 2
1        a           x                       b              2014-01-01 03:00:00 3
1        a           x                       y              2014-01-01 04:00:00 4
2        a           z                       z              2014-01-01 05:00:00 1
2        a           z                       z              2014-01-01 06:00:00 2

 

Step 03: 

 insert into table_03
 select
 a.key1
 ,a.key2
 ,a.data_value1
 ,a.data_value2
 ,a.transaction_dttm as new_valid_from_dttm
 , COALESCE( MAXIMUM ( a.transaction_dttm ) OVER ( PARTITION BY a.key1 , a.key2             ORDER BY a.transaction_dttm ROWS BETWEEN 1 FOLLOWING 
AND 1 FOLLOWING ) , date'9999-12-31' ) AS NEW_VALID_TO_DTTM,
    ,CASE
                WHEN COALESCE( a.data_value1 , 'ZZ' ) = COALESCE( b.data_value1 , 'ZZ' ) 
                AND COALESCE( a.data_value2 , 'ZZ' ) = COALESCE( b.data_value2 , 'ZZ' ) 
AND a.ROW_NUM > 1 THEN 0  
ELSE 1
ENd as data_chk
 from table_02 a join table_02 b
 on a.key1=b.key1
 and a.key2=b.key2
 and a.rownumber=b.rownumber+1
Where data_chk=;

 

 
 

dnoeth 4628 posts Joined 11/04
16 Oct 2014

Are you sure that this is returning the expected result?
Seems like you want to combine multiple consecutive rows with the same values into one, a slowly changing dimension? This is why there are Temporal Tables :)
You might try to utilize the TD_NORMALIZE_MEET function which works on Periods, so you must create a Period first:

INSERT INTO table_02
WITH cte AS
 (
   SELECT  
      Key1
     ,Key2
     ,Data_value1
     ,Data_value2
     ,PERIOD(Transaction_Dttm
            ,COALESCE(MAX(Transaction_Dttm)
                      OVER (PARTITION BY Key1, Key2
                            ORDER BY Transaction_Dttm
                            ROWS BETWEEN 1 FOLLOWING 
                                     AND 1 FOLLOWING)
                     ,TIMESTAMP'9999-12-31 23:59:59.999999')
            ) AS pd
   FROM table_01
 )
SELECT
   Key1
  ,Key2
  ,Data_value1
  ,Data_value2
  ,BEGIN(pd2)
  ,LAST(pd2)  
FROM TABLE (TD_NORMALIZE_MEET
             (NEW VARIANT_TYPE(cte.Key1
                              ,cte.Key2
                              ,cte.Data_value1
                              ,cte.Data_value2)
             ,cte.pd)
     RETURNS (Key1 INT
             ,Key2 CHAR
             ,Data_value1 CHAR
             ,Data_value2 CHAR
             ,pd2 PERIOD(TIMESTAMP(6))
             ,Nrm_Count INTEGER)
     HASH BY Key1, Key2, Data_value1, Data_value2 
     LOCAL ORDER BY  Key1, Key2, Data_value1, Data_value2, pd
 ) dt

 

Dieter

You must sign in to leave a comment.