All Forums UDA
09 Feb 2009
why should we use SET table ?

A SET table force Teradata to check for duplicate rows every time a row is inserted or updated. This can cause a lot of overhead on such operations. Why cant we use multiset tables for all Tables which we are insert by other tables in DWH and use "Group By " to insert rows ?

Jim Chapman 449 posts Joined 09/04
10 Feb 2009

There's no reason to use a SET (as opposed to MULTISET) table if you don't need the automatic duplicatechecking. By the way, if a table has a uniqueness constraint defined on some set of columns (i.e., a primary key or unique index), then there is absolutely no difference between set and multset, as only the uniqueness constraint will be enforced in either case.

Fred 1096 posts Joined 08/04
10 Feb 2009

Actually, there IS a difference between SET and MULTISET even if you have a UNIQUE index:CREATE TABLE T1 (c1 INTEGER NOT NULL, c2 INTEGER) PRIMARY INDEX (c1);INSERT T1 VALUES (1,1);INSERT T1 VALUES (1,2);INSERT T1 VALUES (2,1);CREATE SET TABLE T2_S (c1 INTEGER NOT NULL) UNIQUE PRIMARY INDEX (c1);CREATE MULTISET TABLE T2_MS (c1 INTEGER NOT NULL) UNIQUE PRIMARY INDEX (c1);INSERT T2_S SELECT c1 FROM T1;/* succeeds - quietly eliminates the duplicate and inserts 2 rows */INSERT T2_MS SELECT c1 FROM T1;/* fails with duplicate key error */

11 Feb 2009

Hi Jim Chapman / Fred PluebellI am concerned about Time taken to load the Set Vs MultiSet tablesIn my application there are many Set tables created for storing temp result For loading these set tables its taking more time, as we are planning to change it in to MultiSet and put a group by all columns to remove duplicate rows , table index is Primary Index, (is it the correct way?)

Fred 1096 posts Joined 08/04
11 Feb 2009

First, I'd suggest you verify that you have chosen a good Primary Index. If the number of rows with any given PI value is small, the SET table duplicate row checking overhead is small. There may be cases in which an extremely non-unique PI is a reasonable choice; in those cases changing to MULTISET is certainly worth considering. Just be sure you aren't covering up the real problem.

jconcepcion1 4 posts Joined 03/12
10 Jul 2012

(Somewhat of a late comment relative to the prior posts, but the internet is timeless and it will help any googlers).

Can someone quantify when they say that a SET table has overhead? I like the fact that a SET table discards duplicates for me. I want to know how it scales. If I want to store, say, person id and sku pairs for billions of rows, I simply insert into my SET table knowing that it will discard prior pairs. That's better than creating a proper of NOT EXISTS or other techniques if it were a MULTISET table. How far can I expect to do this until the overhead is not worth it?

 

dnoeth 4628 posts Joined 11/04
11 Jul 2012

To insert a new row the system has to do a duplicate row check for all existing rows with the same RowHash value, i.e. all rows with the same PI value or another PI value which happens to hash the same.

And of course it's cumulative, to insert N rows with the same RowHash you'll need (N*(N-1)) / 2 checks:

10 rows -> 45 dup row checks
100 -> 4950
1000 -> 499500

The rule of thumb is:
If the number of rows per value is low (maybe up to a few hundred) and all rows fit on a single datablock it might be acceptable.

Dieter

Dieter

jconcepcion1 4 posts Joined 03/12
13 Jul 2012

Dieter,

How does each instance of a dup row check equate to time? This is way beyond a linear increase, but I'm not seeing that. I have a x00 million row SET table where I can insert x0 million rows to it daily with net new rows about 5%, and I'm not getting the degradation timewise that the formula would indicate.

What would you advise as a more efficient technique where I simply want to sweep new pairs into my SET pair table . As is, I don't have to worry about ETL change data coding.

Appreciate your feedback.

Julius

 

 

dnoeth 4628 posts Joined 11/04
15 Jul 2012

Hi Julius,

what's the average number of rows per value for this table?

The formula is only to compare the overhead for different rows per PI, when the count is doubled the number of dup checks is quadrupled to insert all those rows with the same PI.

When the number of rows per PI increases due to inserting new rows with an existing PI you should see a linear increase of CPU usage in DBQL, e.g. existing number of rows per PI value: 50, insert 1 row per day per PI.
After 50 days the number of rows per PI doubled, to insert a new row there's 99 dup checks now vs. 49 on the first day, resulting in doubled CPU time per row.

Dieter

Dieter

iamsughesh 2 posts Joined 09/11
19 Jul 2012

SET tables will insert rows quickly as we start inserts, but will become
much slower as its record count in table reached millions.

This is because, as SET table won't allow entire row duplicate, as you
insert new rows to the table
TD system should check whether particular row is already present or not.

And if row is already present, it won't insert particular row. It won't give
any error message also.

As this is an automatic process, lot of resources and processor time
required to do the same.

Initially it work faster and as record count reach millions, it becomes
slow.

Possible cause is 'duplicate row check'. This process is likely to slow down
as you add more rows to the target table. This is a problem when you have
too many rows per PI data value (well, technically PI rowhash) AND the table
is SET AND there are no unique indexes on the target table.

- as the process starts, there are relatively few rows for any given PI
rowhash value.
- duplicate row checking happens, but with only a few rows to check it
doesn't appear to slow you down
- as you add more data to the target table, the number of rows with same PI
rowhash grows, and you can now 'see' the processing slow down.

If the duplicate row check looks like the culprit then a couple of things
come to mind
- add a unique index
- change the table to multiset (providing of course that your data
model/processing can handle this)

If the table has a unique index (either upi or usi) then you will ** not **
have a problem with 'duplicate row checks'.

The only time you have problems with duplicate row checks is when:
- the table is SET, AND
- the table has NO unique indexes, AND
- you have lots of rows with the same PI rowhash value.

If any of the above conditions are NOT met then you will not have a problem
with duplicate row checks during insert  processing.

Thanks,

Sughesh I S

Regards,
Sughesh I S.

dnoeth 4628 posts Joined 11/04
19 Jul 2012

Hi Sughesh,

why do you simply copy answers from a recent thread at www.teradataforum.com?

The first five phrases were written by you, but the rest is from Dave Wellmann.

And there was another post:

SET/NUPI plus USI still does duplicate row checks for insert/selects,
only single row inserts benefit from the USI.

You can easily see that when you insert/select the same rows a second time:
http://developer.teradata.com/node/6989#comment-14416
 

Dieter

Dieter

xwang2 1 post Joined 10/10
03 Oct 2012

Multiset table makes explain plan different from set table when a few table join tgatger and the results return faster. My guess is that Teradat choose a different explan plan when it sees multiset table. But I do not understand why. Have anyone else had a silmilar experince.

dnoeth 4628 posts Joined 11/04
03 Oct 2012

When statistics exist on the join columns there should be no different plan and when the plan doesn't change the speed will not change.
But when stats are missing (on the one part of a one-to-many join) the optimizer assumes more than one row will be joined and then overestimates the number of rows returned by the join step. This is a builtin factor, which seems to vary based on some unknown conditions, let's assume 1.1:
Base table has 1.000.000 rows and all the joins are to the PK of the 2nd table:
1st join: estimated 1,100,000, actual 1,000,000
2nd join: estimated 1,210,000, actual 1,000,000
3nd join: estimated 1,331,000, actual 1,000,000
etc.
If the factor is different for multiset tables this might explain different plans, but i never tested it.
Dieter
 

Dieter

Santhosh.SaSo 3 posts Joined 10/12
05 Oct 2012

Hi friends..!!
i am new for teradata project..
i want to know, what type of error's will come, when data is loading,and how to over come those errors..
and give me some tips, i want to become a best in Teradata Database Development.. please help me..
can any one please explain me..
 
Thank you.. all of you..!! 

Santhosh.SaSo 3 posts Joined 10/12
05 Oct 2012

Hi.. Mr.Dnoeth..
your way of exp is very very good, and easy to understand... i am new for Teradata Project..
please help me..  can you send any dummy project to my mail ID.. Please..
Santhosh.saso@yahoo.in.. this is my mail id..
 

AdamL 5 posts Joined 01/13
14 Jun 2013

In reference to third comment posted by Fred, 
 

I created a similar table T2_S1 as original source table T1.

CREATE volatile SET TABLE T2_S1 (c1 INTEGER NOT NULL,c2 INTEGER) UNIQUE PRIMARY INDEX (c1) on commit preserve rows;

Now, i tried insert select into this SET table. But, it gave the error.

 

INSERT T2_S1 SELECT c1,c2 FROM T1;/* succeeds - quietly eliminates the duplicate and inserts 2 rows */

 

Here it is failing, *** Failure 2801 Duplicate unique prime key error in iTunes_User.T2_MS.

                Statement# 1, Info =0 

 *** Total elapsed time was 1 second.

 

It seems the insert select for set table case with duplicates being discarded, is not working fine.

 

Can u please suggest, the behavior here? Is it, when we select all the columns i.e an entire row from the source table, duplicate row check is done for the target table.

 

Regards,

Adam

 

Fred 1096 posts Joined 08/04
15 Jun 2013

Not exactly. Duplicate row check is being done in both cases, but in your example the entire rows (1,1) and (1,2) are not duplicate so neither is eliminated. Therefore the uniqueness violation on the PI is reported.
In my example, the rows (1) and (1) were identical so one was quietly discarded, and there was no uniqueness violation.

AdamL 5 posts Joined 01/13
15 Jun 2013

Thanks Fred. I got it know :)

vasudev 24 posts Joined 12/12
17 Jul 2013

Hi,
I am using TD12. I am having three questions regarding the SET tables. 1.Whether a NUPI slow down inserts into the SET tables? 2. Is there any limitation in number of index for set tables loaded by MLOAD? 3. Whether FLOAD performs the duplicate row check in SET tables?
Please advise.

Fred 1096 posts Joined 08/04
17 Jul 2013
  1. Simply having a NUPI does not slow down inserts, and having a small number of duplicates won't be a problem. But when you insert the Nth row with the same NUPI, Teradata must check against all N-1 existing rows to verify the new row is not a complete duplicate. That overhead grows very quickly as N increases.
  2. No MLOAD-specific limit on number of NUSIs.
  3. FastLoad removes duplicate rows in Phase 2 (even for MULTISET tables). The process is slightly different from SQL INSERT / UPDATE duplicate row check.
vasudev 24 posts Joined 12/12
18 Jul 2013

Hi Fred,
Thank you very much for your response. Whether FLOAD performs the duplicate rows check for the SET tables also?And also is there any limitations to load the SET table?
Please advise
 

suhailmemon84 47 posts Joined 09/10
18 Dec 2013

quick question:
Will this cause a nupi duplicate row check operation too?
Insert into <A_4_COLUMN_set_table_with only NUPI ON IT>
SELECT COL1, COL2,COL3,COL4 FROM
<MULTISET SET STAGE TABLE>
GROUP BY 1,2,3,4
Regards,
Suhail

M.Saeed Khurram 544 posts Joined 09/12
18 Dec 2013

Hi Suhail,
You question is a little bit ambigous, but if you have a NUPI defined on a SET table, the duplicate row check will be performed. 
You can avoid duplicate row check by either defining a table as Multiset, or making the column UPI. One other way to impllement UNQUENESS you can define a USI.
 

Khurram

suhailmemon84 47 posts Joined 09/10
18 Dec 2013

I guess the ambiguity was that I missed to mention that the target table is empty(I apologize). 
I wanted to know if nupi dup row check will be done if we're inserting data into an empty set table from a multiset stage table inspite of a group by on all columns in the stage table(thereby removing duplicates)
-Suhail
 

M.Saeed Khurram 544 posts Joined 09/12
18 Dec 2013

Yes, as long as the table is set and index is non unique the duplicate row check will be performed. The table is empty initially but with the insertion of rows, the table will no longer be empty and set table will be comparing rows byte by byte to avoid duplicates.
 

Khurram

suhailmemon84 47 posts Joined 09/10
18 Dec 2013

I apologize for being persistent on this.
So you do confirm that the very first insert-select(where the insert is happening into a empty table with 0 records) will also be subjected to NUPI DUP ROW CHECK?
I can understand why subsequent inserts into the same table will be subjected to nupi dup row check but why the first one?
-Suhail

cwindland 11 posts Joined 11/04
29 Apr 2015

Dieter:
I've always been under the impression that a good way to improve the insert efficiency (via SQL) of a set table is via the use of a PK/USI.  Prior to insert, the USI is used to check for duplicates through the hash values of the index columns.  This saves the SET Table second row instance check.  I was recently told that the row is first inserted and then the index table is inserted.  The index insert may or may not pass the duplicate test and will fail accordingly.  Consequently, a USI won't save time.  This does not sound reasonable to me.  Wouldn't it make more sense to check the hash value of the USI first for hash duplicates and save all this I/O?
 
Can you clear this up for me?
 
Thank you in advance
 
 

jrgchip 4 posts Joined 05/15
04 May 2015

@Fred
I used your 2/10/2009 example SQL exactly in a TD v14 env and both INSERT statements failed on the same 2801 error.
Might TD have changed the behavior such that SET and MULTISET now function the same in your scenario?

Fred 1096 posts Joined 08/04
06 May 2015

No, the behavior of SET / MULTISET is unchanged. The INSERT/SELECT into the SET table succeeds and inserts two rows.

CREATE TABLE T1 (c1 INTEGER NOT NULL, c2 INTEGER) PRIMARY INDEX (c1);
INSERT T1 VALUES (1,1);
INSERT T1 VALUES (1,2);
INSERT T1 VALUES (2,1);
CREATE SET TABLE T2_S (c1 INTEGER NOT NULL) UNIQUE PRIMARY INDEX (c1);
CREATE MULTISET TABLE T2_MS (c1 INTEGER NOT NULL) UNIQUE PRIMARY INDEX (c1);
INSERT T2_S SELECT c1 FROM T1;/* succeeds - quietly eliminates the duplicate and inserts 2 rows */
INSERT T2_MS SELECT c1 FROM T1;/* fails with duplicate key error */

 

You must sign in to leave a comment.