All Forums Database
nevermore 8 posts Joined 03/13
25 Jan 2014
Generate SurrogateKey with a huge table

Hello everybody,
I've a table which has over 20 billion record for now, and it'll get bigger everyday.
So Ive problem with generate skey, because it takes to much time.
I use ROW_NUMBER() + SEL MAX_ID logic. but even SEL MAX_ID statement took over 20 minutes.
 
such that big table, shoul I use any other logic to create skey.
 
thanks.
 
 
 

Raja_KT 1246 posts Joined 07/09
25 Jan 2014

I used same method when loading from staging to target. I think you use staging too. Do you use? Please let me know.

I feel that at any cost, you have to do it, one way or the other. My thought.Now I can think of

running the job during low traffic time(maybe night time).

 

Cheers,

Raja K Thaw
My wiki: http://en.wikipedia.org/wiki/User:Kt_raj1
Street Children suffer not by their fault. We can help them if we want.

nevermore 8 posts Joined 03/13
25 Jan 2014

first I create skey from staging, then I load data to target using that skey 
what you mean with "one way or the other". can you tell about more pls.
thnks.

Raja_KT 1246 posts Joined 07/09
26 Jan 2014

 

So , yes it is the same concept we did. I think it is a better way to do this way,rather than using sequence or identity number. There may be smarter ways of doing, if someone can share.

Cheers,

Raja K Thaw
My wiki: http://en.wikipedia.org/wiki/User:Kt_raj1
Street Children suffer not by their fault. We can help them if we want.

Santanu84 122 posts Joined 04/13
27 Jan 2014

Hi
In my knowledge, within teradata there are 3 ways to generate skey.
1. Indentity Column, 2. Hashing Algorithm and 3. Analytical Function
The 3rd option is more effective. We can use either ROW_NUMBER() or CSUM() to generate skey. But as the table will grow so the processing time.
Hope with new TD version they will provide sequence generator option like oracle.
However, you can try generating skey thorugh any ETL/ELT tools also.
I agree with Raja, if the experts here can share their experience for a smarter way that would be great.
 
Thanking You
Santanu

Raja_KT 1246 posts Joined 07/09
28 Jan 2014

Hi,
The third one is discussed in detail with explanation from expert:
http://forums.teradata.com/forum/database/best-way-to-generate-the-sequential-numbers-csum-or-identity-columns
sequence generator option like oracle: It has loop hole too, if you use currval, nextval etc when rerun and mess up your data and want to go for correction.
Yes in ETL tool, like Ab Initio(next_in_sequence etc), informatica..sequence geneartor, SAP DS..key_generation etc.
Cheers,

Raja K Thaw
My wiki: http://en.wikipedia.org/wiki/User:Kt_raj1
Street Children suffer not by their fault. We can help them if we want.

padhia 35 posts Joined 06/10
28 Jan 2014

How about storing the max value in a separate table instead of calculating every time?
For example, let table sk_seed have two columns (tablename varchar(128), next_sk bigint) with tablename being the PI, then psuedo code for generating surrogate key would be something like:
 

bt;

-- lock the seed row for write in case of concurrent usage
update sk_seed set next_sk = next_sk+0 where tablename = 'staging_table'; 

-- generate and use surrogate keys
select (select next_sk from sk_seed where tablename = 'staging_table') 
      + row_number() over(order by natural_key) as surrogate_key
from staging_table;

-- update next_sk for the next iteration
update sk_seed set next_sk = next_sk + (select count(*) from staging_table) where tablename = 'staging_table'; 

et;

 
One downside of this approach is, if surrogate keys generation logic isn't centralized to use above logic then sk_seed table may get out-of-sync and you'll get duplicate surrogate keys.
 

Adeel Chaudhry 773 posts Joined 04/08
31 Jan 2014

Is there any compulsion in using ROW_NUMBER?
 
From my experience, CSUM and ROW_NUMBER can cause the response time to slow-down considerably.
 
For smaller tables they work fine, but for larger tables i prefer using [and is getting much better results] following logic:
 

SELECT

SUM(1) OVER(ROWS UNBOUNDED PRECEDING) + AliasMAX.MAX_VALUE AS SK

,Col1

,Col2

FROM LRD.Table1 LRD

LEFT OUTER JOIN

  EDW.Table1 EDW

ON

EDW.ID = LRD.ID

,

 (

  SELECT

   COALESCE(MAX(ID), 0) AS MAX_VALUE

  FROM

   EDW.Table1

 ) AliasMAX

 

HTH!

-- If you are stuck at something .... consider it an opportunity to think anew.

dnoeth 4628 posts Joined 11/04
10 Feb 2014

Hi Adeel,
did you check DBQL for resources used?
SUM(1) or COUNT(*) OVER (ROWS UNBOUNDED PRECEDING) (without PARTITION BY) should be worse than ROW_NUMBER:
There will be an implicit ORDER BY over all columns while ROW_NUMBER can sort by less than all columns, so SUM should consume more spool and more CPU.
 
If there's PARTITION BY then SUM actually doesn't sort -> faster than ROW_NUMBER

Dieter

Adeel Chaudhry 773 posts Joined 04/08
18 Feb 2014

Indeed Dieter .... I wrote SQL on-the-fly and missed PARTITION BY in there. Thanks! :)

-- If you are stuck at something .... consider it an opportunity to think anew.

hs255021 2 posts Joined 06/15
20 Jun 2015

Hi there..
Can some body tell me how to do this..
I have a large table t1 which has billion records with columns id(c1), c2, c3, c4 and 10 more columns. The value of columns c2, c3 and c4 are incorrect in this table for ids 500 to 1000. 
Another table t2 has the same definition as t1 but it has only columns id(c1), c2, c3 and c4. The columns in both the tables are same. This t2 table has correct data for columns c2, c3 and c4.
Now I need to update table t1 with the correct data using table t2 without using an "update" command as updating the large table t1 is costly. But I can use staging table..
Please let me know.. Thanks..

TDThrottle 51 posts Joined 11/11
20 Jun 2015

Since primary index(id(c1)) is same for t1 and t2 you may not face hard performance issue with direct update. In other case using staging tabe to delete and insert ids 500 to 1000 from/to t1 will have same overhead.
Thanks!!

hs255021 2 posts Joined 06/15
23 Jun 2015

Thank you.. Can you please be more detail in how staging table will be used here..

TDThrottle 51 posts Joined 11/11
24 Jun 2015

Option-2: Use ETL approach
1. Create a copy of t1 for table ids 500 to 1000
2. Delete t1 for table ids 500 to 1000
3. Update <t1-copy> using t2
4. Insert <t1-copy> back to t1
You can execute both options (1) direct t1 update and (2) ETL approach and measure the overhead. As said earlier, since PI is same for t1 and t2 expect performance remains good using option (1)
 

You must sign in to leave a comment.