The following information covers a handful of the basics of the Teradata architecture, to help new Teradata users better understand what considerations should go into creating a table.

Teradata Architecture

Teradata uses Massively Parallel Processing (MPP) to provide linear scalability of the system by distributing the data across a number of processing units (AMPs). Each record in a table is placed on an AMP. The more evenly the data is distributed across the AMPs for all tables, the better the system performs, because each AMP does an equal amount of work to satisfy a query. When the data is unevenly distributed, the AMPs with the most records work harder, while the AMPs with the fewest records are underutilized.

Key point: Teradata performs the best when data is evenly distributed

Primary Index

The primary index (PI) distributes the records in a table across the AMPs, by hashing the columns that make up the PI to determine which records go to which AMP. If no PI is specified when a table is created, the first column of the table will be used as the PI.

When creating a table, care needs to be taken to choose a column or set of columns that evenly distribute the data across the AMPs. A PI that distributes data unevenly will at the very least impact the performance of the table, and depending on the size of the table, has the potential to negatively impact the entire system.

Even distribution of the PI isn’t the only criteria to use when choosing a PI. Consideration should also be given to how the data will be queried. If the data can be evenly distributed using different sets of columns, then the determination of which columns to use should be based on how the data will be queried and what other tables it will be joined to. If two tables that are frequently joined have the same PI, then joining them doesn’t require the records to be redistributed to other AMPs to satisfy a query.

A PI doesn’t have to be the same as the primary key (PK) of a table. The purpose of a PI is to evenly distribute the data, while the purpose of a PK is to identify unique records. The PI and PK can be the same, but it isn’t required.

Key point: Picking the right primary index is critical to ensuring good table and system performance

Skew Factor

A table that has perfectly distributed data has a skew factor of 0%. The higher the skew factor is, the more unevenly data in a table is distributed. As a general rule, tables with a skew factor higher than 50% should be evaluated to determine if a different primary index would distribute the data better, and thereby improve performance.

Tables with fewer records than the number of AMPs will have a higher skew factor that 0%, simply because the records cannot be evenly distributed across all of the AMPs. For tables that have fewer records than the number of AMPs the skew factor of the table may not be improved by choosing a different primary index.

Key point: Skew factor indicates how evenly a table’s data is distributed

Table Creation Options

Along with choosing the PI of a table, another choice needs to be made when creating a table. The two options are SET and MULTISET, with SET being the default if neither is specified. A SET table prohibits duplicate records with identical values in every column from existing in the table, while a MULTISET table allows them. When using FastLoad or the TPT Load Operator, if you attempt to insert duplicate records into a SET table, the duplicates are discarded without any notification that an attempt to insert duplicates took place. A SET table with no unique primary index has a performance overhead because every record that is inserted must be evaluated to determine if a duplicate already exists in the table. This overhead can be minimized by defining a unique index on the table (see the Teradata Database Database Design manual for more information on minimizing duplicate row checks for tables without unique primary indexes).

Create table syntax examples:

CREATE SET TABLE … (results in a SET table being created)

CREATE MULTISET TABLE … (results in a MULTISET table being created)

CREATE TABLE … (results in a SET table being created with Teradata semantics mode and results in a MULTISET table being created with ANSI semantics mode)

Key point: SET or MULTISET in a table creation statement determines whether duplicate records can be stored in the table

Suggestions for further reading

Teradata provides extensive documentation at http://www.info.teradata.com/. Download the full documentation set for the release your site uses, and then start with the document titled either  “Introduction to Teradata Warehouse” for V2R6 and 12.0 or “Introduction to Teradata” for 13.0 and 13.10.

Discussion
bwb 4 comments Joined 12/10
04 Jan 2011

One minor correction: CREATE TABLE will result in a SET table being created if the session is using Teradata semantics; CREATE TABLE will result in a MULTISET table being created if the session is using ANSI semantics.

The semantics mode (Teradata vs. ANSI) can be set globally via DBSControl (SessionMode GDO), or at the session level by the user with CLIv2 applications (DBCAREA Transaction Semantics field) or certain Teradata utilities (e.g., PP2 TRANSACT(ANSI|BTET) or BTEQ .SET SESSION TRANSACTION ANSI|TERADATA). Note: despite the use of "transaction" in the options, more than transaction semantics are affected.

jskarphol 6 comments Joined 07/10
05 Jan 2011

Thanks for providing the clarification on the effect of the Teradata and ANSI semantics mode on a CREATE TABLE statement when SET or MULTISET aren't specified. I've edited the article to add this clarification.

jana.teradata 2 comments Joined 02/10
06 Jan 2011

That was a good explanation....

is there a way that we can know which semantic we are in ?
for example if we connect from Query man .. what is the default semantic and if we connect from bteq what is the default semantic ?

jskarphol 6 comments Joined 07/10
06 Jan 2011

You can tell which semantic mode you're with SQL Assistant or BTEQ using the following information:
SQL Assistant (AKA Queryman) - run the HELP SESSION command and look for the column named Transaction Semantics to tell whether you're in Teradata or ANSI mode
BTEQ - log into BTEQ and look for the message that says 'Tranaction semantics are' followed by BTET (Teradata) or ANSI

its_sashi 1 comment Joined 01/11
21 Jan 2011

Hi,

The explanation was very clear and crisp, thanks for the info.

breidy 1 comment Joined 03/11
08 Mar 2011

If a SET table is create with a UNIQUE INDEX and/or a PRIMARY KEY INDEX on a subset of the columns in the table, will the database still examine each additional column for duplication?

Table ( colA_SK
, colB_NK
, colC_NK
, colD
, colE
, ... colZ)
UNIQUE PRIMARY INDEX (colA_SK)
UNIQUE INDEX (colB_NK, colC_NK)

With those unique indexes no combination of colB_NK and colC_NK can be duplicated, would the database continue to validate the uniqueness of the record with colD - colZ on a SET table? Or should you set it up as a Multiset table with the Unique Indexes. (ETL process will also be taking into account the NK and PK on the table for INSERTS vs UPDATES on the tabl)

bwb 4 comments Joined 12/10
08 Mar 2011

With a UNIQUE PRIMARY INDEX, there will be no duplicate row checking (there can't be any duplicate rows if the PIs are unique).

With a non-UNIQUE PRIMARY INDEX, but one or more UNIQUE INDEX, duplicate row checking is not necessary (since there can't be any duplicate rows if one or more SIs are unique). However, the ordering of the internal INSERT and UPDATE operations may result in a duplicate row check, and that certainly used to be the case; I have a query in to one of the DBS experts to find out if that is still the case.

Personally, I'm opposed to using MULTISET tables as a way to bypass the duplicate row check. It's all to easy to get into that habit, and if a particular MULTISET table has no UNIQUE indexes, the potential exists to end up with duplicate rows...and if the logic that accesses such a table assumes that there are no duplicates, you're potentially in big trouble.

jskarphol 6 comments Joined 07/10
09 Mar 2011

Thank you for the question and the answer provided in the comments on duplicate row checking for SET tables. I've updated the article to provide clarification on when duplicate row checking occurs on a SET table, including a reference to the Teradata Database Database Design manual for further information.

bwb 4 comments Joined 12/10
09 Mar 2011

You're welcome. I did get confirmation that if there is any UNIQUE INDEX, there will be no duplicate row checking.

There is one exception. If the error-logging option is used on an INSERT-SELECT or MERGE INTO targeting a SET table, duplicate row checking will be done if the PRIMARY INDEX is non-UNIQUE, regardless of the presence or absence of UNIQUE INDEX.

Also, I have a couple of other comments on the article:

1. In addition to performance, a significant skew factor results in more space being taken for table data on some AMPs, and less on others. A skew factor of 50% would mean that some AMPs have (roughly) twice as much data for the table than others; that could cause disk space to be exhausted on the high-skewing AMP(s) when there's lots of space free on the other AMPs. Except for tables with very few rows, I would guess that a skew factor of more than 10% (perhaps more than 5%) is abnormal and probably indicates a poor choice of PI. 50% would just be huge (my classic example of such bad skewing is someone making the PI of a personnel table the "sex" column, where all the rows would end up on two AMPs). In all my years with Teradata, I've never heard of a real-world (i.e., customer) skew more than 10% for any table with a significant number of rows.

2. With regard to duplicate rows, the statement "If you attempt to insert duplicate records into a SET table, the duplicates are discarded without any notification that an attempt to insert duplicates took place." isn't strictly true. It happens to be true if FastLoad (or the TPT Load Operator) is used. However, when using MultiLoad (or the corresponding TPT Update Operator) or TPump (or the corresponding TPT Stream Operator), you can choose whether or not duplicate rows are recorded. If you're using SQL directly (e.g., BTEQ), duplicate rows will always be reported (2802 error).

dnoeth 86 comments Joined 11/04
10 Mar 2011

The main difference between SET/MULTISET tables can be noticed when there's INSERT VALUES vs. INSERT SELECT:
INSERT VALUES never silently discards duplicate rows, but INSERT SELECT will if the table is SET.

I used this BTEQ-script in my trainings to show all the variations:

.SET ECHOREQ OFF
.SET TIMEMSG NONE

-- SET/NUPI
CREATE SET TABLE dropme(i INT, j INT) PRIMARY INDEX(i);
INSERT INTO dropme VALUES(1,1); -- *** One row added.
INSERT INTO dropme VALUES(1,1); -- *** Failure 2802 Duplicate row error
INSERT INTO dropme SELECT * FROM dropme; -- *** No rows added.
CREATE ERROR TABLE FOR dropme;
INSERT INTO dropme SELECT * FROM dropme LOGGING ALL ERRORS; -- *** No rows added.
SELECT etc_errorcode FROM et_dropme WHERE etc_errorcode <> 0; -- *** No rows found.
DROP ERROR TABLE FOR dropme;
DROP TABLE dropme;

-- SET/UPI
CREATE SET TABLE dropme(i INT, j INT) UNIQUE PRIMARY INDEX(i);
INSERT INTO dropme VALUES(1,1); -- *** One row added.
INSERT INTO dropme VALUES(1,1); -- *** Failure 2801 Duplicate unique prime key error
INSERT INTO dropme SELECT * FROM dropme; -- No rows added.
CREATE ERROR TABLE FOR dropme;
INSERT INTO dropme SELECT * FROM dropme LOGGING ALL ERRORS; -- *** No rows added.
SELECT etc_errorcode FROM et_dropme WHERE etc_errorcode <> 0; -- *** No rows found.
DROP ERROR TABLE FOR dropme;
DROP TABLE dropme;

-- SET/NUPI/USI
CREATE SET TABLE dropme(i INT, j INT ) PRIMARY INDEX(i), UNIQUE INDEX(j);
INSERT INTO dropme VALUES(1,1); -- *** One row added.
INSERT INTO dropme VALUES(1,1); -- *** Failure 2803 Secondary index uniqueness violation
INSERT INTO dropme SELECT * FROM dropme; -- *** No rows added.
CREATE ERROR TABLE FOR dropme;
INSERT INTO dropme SELECT * FROM dropme LOGGING ALL ERRORS; -- *** No rows added.
SELECT etc_errorcode FROM et_dropme WHERE etc_errorcode <> 0; -- *** No rows found.
DROP ERROR TABLE FOR dropme;
DROP TABLE dropme;

-- SET/UPI/USI
CREATE SET TABLE dropme(i INT, j INT) UNIQUE PRIMARY INDEX(i), UNIQUE INDEX(j);
INSERT INTO dropme VALUES(1,1); -- *** One row added.
INSERT INTO dropme VALUES(1,1); -- *** Failure 2801 Duplicate unique prime key error
INSERT INTO dropme SELECT * FROM dropme; -- *** No rows added.
CREATE ERROR TABLE FOR dropme;
INSERT INTO dropme SELECT * FROM dropme LOGGING ALL ERRORS; -- *** No rows added.
SELECT etc_errorcode FROM et_dropme WHERE etc_errorcode <> 0; -- *** No rows found.
DROP ERROR TABLE FOR dropme;
DROP TABLE dropme;

-- MULTISET/NUPI
CREATE MULTISET TABLE dropme(i INT, j INT) PRIMARY INDEX(i);
INSERT INTO dropme VALUES(1,1); -- *** One row added.
INSERT INTO dropme VALUES(1,1); -- *** One row added.
INSERT INTO dropme SELECT * FROM dropme; -- *** 2 rows added.
CREATE ERROR TABLE FOR dropme;
INSERT INTO dropme SELECT * FROM dropme LOGGING ALL ERRORS; -- *** 4 rows added.
SELECT etc_errorcode FROM et_dropme WHERE etc_errorcode <> 0; -- *** No rows found.
DROP ERROR TABLE FOR dropme;
DROP TABLE dropme;

-- MULTISET/UPI
CREATE MULTISET TABLE dropme(i INT, j INT) UNIQUE PRIMARY INDEX(i);
INSERT INTO dropme VALUES(1,1); -- *** One row added.
INSERT INTO dropme VALUES(1,1); -- *** Failure 2801 Duplicate unique prime key error
INSERT INTO dropme SELECT * FROM dropme; -- *** Failure 2801 Duplicate unique prime key error
CREATE ERROR TABLE FOR dropme;
INSERT INTO dropme SELECT * FROM dropme LOGGING ALL ERRORS; -- *** No rows added.
SELECT etc_errorcode FROM et_dropme WHERE etc_errorcode <> 0; -- *** One row found. 2801
DROP ERROR TABLE FOR dropme;
DROP TABLE dropme;

-- MULTISET/NUPI/USI
CREATE MULTISET TABLE dropme(i INT, j INT) PRIMARY INDEX(i), UNIQUE INDEX(j);
INSERT INTO dropme VALUES(1,1); -- *** One row added.
INSERT INTO dropme VALUES(1,1); -- *** Failure 2803 Secondary index uniqueness violation
INSERT INTO dropme SELECT * FROM dropme; -- *** Failure 2803 Secondary index uniqueness violation
CREATE ERROR TABLE FOR dropme;
INSERT INTO dropme SELECT * FROM dropme LOGGING ALL ERRORS; -- *** Failure 9127 Index violations detected
SELECT etc_errorcode FROM et_dropme WHERE etc_errorcode <> 0; -- *** One row found. 2803
DROP ERROR TABLE FOR dropme;
DROP TABLE dropme;

-- MULTISET/UPI/USI
CREATE MULTISET TABLE dropme(i INT, j INT) UNIQUE PRIMARY INDEX(i), UNIQUE INDEX(j);
INSERT INTO dropme VALUES(1,1);
INSERT INTO dropme VALUES(1,1); -- *** Failure 2801 Duplicate unique prime key error
INSERT INTO dropme SELECT * FROM dropme; -- *** Failure 2801 Duplicate unique prime key error
CREATE ERROR TABLE FOR dropme;
INSERT INTO dropme SELECT * FROM dropme LOGGING ALL ERRORS; -- *** No rows added.
SELECT etc_errorcode FROM et_dropme WHERE etc_errorcode <> 0; -- *** One row found. 2801
DROP ERROR TABLE FOR dropme;
DROP TABLE dropme;

Dieter

jskarphol 6 comments Joined 07/10
10 Mar 2011

Thanks to bwb and dnoeth for their helpful comments. I made a minor clarification to this article about FastLoad or the TPT Load Operator being the utilities that will discard duplicate rows without any notification when they're used to insert into a SET table.

JimmyLee 5 comments Joined 06/10
16 Mar 2011

Nicely done.

"To not give your best is to sacrifice the gift" -- Steve Prefontaine

umakathir 5 comments Joined 09/06
26 Aug 2011

Nice explanation with example. However, I have a question on SET/NUPI performance. I feel SET/MULTISET plays the higher level than Index, do we still find any performance degrage due to SET/NUPI?

jana.teradata 2 comments Joined 02/10
08 Mar 2012

That is a very nice explanation dieter... you rock !!!!

itsmeabhi99 1 comment Joined 08/11
05 Apr 2012

With Dieter's wonderful explanation, one thing is clear that Teradata behaves differently when using INSERT INTO .... VALUES and INSERT INTO ... SELECT *...

Does anyone know why is this differentiation?

umakathir 5 comments Joined 09/06
05 Apr 2012

I also found that there is a difference on the identity column manipulation.... Thought we specified the start and end value along with increment count, it behaves differently. When we do Insert Into Values, it starts with 1 and incremented by 1 whereas when we do Insert Into ...Select, it starts with some 1000 or 10000 and incremented based on the increment value specified... don't know if there is any other dbc level setting for that.

jmamedov 5 comments Joined 02/12
24 May 2012

Thank you jskarphol for article.
Is there any impact of choice of data type on selection of primary index?
Varchar, char or integer? Since hashed values will be stored what if any difference makes data type of primary index?
What do you think about adding surrogate keys as a column to all tables as ultimate candidate for primary index? I understand that users utilize natural keys for data selection but, those can be handled with secondary indexes.

22 Jul 2014

Thank you,very informative
 

You must sign in to leave a comment.