
Working with Identity Columns and Unity Director and Loader
One of Unity Director and Loader’s core benefits is the ability to keep multiple Teradata systems synchronized with online, transactional-consistent, changes. Unlike post-transactional replication, Unity Director’s SQL-multicast sends requests in parallel to all connected systems at the same time. It also performs consistent checking on the responses of every write statement to ensure changes both systems succeed and produce the same results.
While this works great for normal data, Identity columns pose a special problem. Identity columns in a teradata database are auto-generated numbers inside the Teradata database that are typically used as surrogate keys.
Consider the following employee table, which uses a default IDENTITY column for an id:
CREATE SET TABLE tbl_emp ,NO FALLBACK , NO BEFORE JOURNAL, NO AFTER JOURNAL, CHECKSUM = DEFAULT, DEFAULT MERGEBLOCKRATIO ( id INTEGER GENERATED ALWAYS AS IDENTITY (START WITH 1 INCREMENT BY 1 MINVALUE 0 MAXVALUE 1000000 NO CYCLE), Name VARCHAR(20) CHARACTER SET LATIN NOT CASESPECIFIC, Phone VARCHAR(10) CHARACTER SET LATIN NOT CASESPECIFIC) PRIMARY INDEX ( id );
Since Unity Director/Loader has no was to synchronize the current value of the IDENTITY column across multiple systems and it can’t see the value provided inside the database for the ID column, it can’t guarantee the consistency of the data. For this reason Unity Director and Loader disallow writes to table using IDENTITY columns:
BTEQ -- Enter your SQL request or BTEQ command: insert into tbl_emp (Name,Phone) values ('Paul','1234567890'); insert into tbl_emp (Name,Phone) values ('Paul','1234567890'); *** Failure 4514 Query is disallowed because it writes to a table containing an identity column. *** Total elapsed time was 1 second.
This can pose a challenge for anyone that wants to adopt Unity Director and Loader to do data synchronization but also needs to write to tables that require IDENTITY columns. Let’s look at some of the ways to overcome this challenge.
Do you need to write to this table via Unity Director or Loader?
Just because you have a table with an identity column in it, does not automatically mean you have a problem. If the table is relatively static and unchanging or changes at scheduled times, it might be easier to use a Load-Copy approach by loading the table directly on one system, and then replicating it to other systems using DataMover. Unity Ecosystem manager can be used to coordinate this approach.
Consider removing the Identity column
The first option considered should always to be eliminate the IDENTITY column altogether. Whether this is possible depends on why a surrogate key was chosen in the first place.
- In many situations, IDENTITY columns have been used to implement multi-system replication methods using change data capture – methods that may now be irrelevant in a Unity Director/Loader environment, since Director and Loader don’t need to rely on change data capture.
- Sometimes surrogate keys are chosen in order to provide better distribution than the natural key of the table would provide. There are, however, other alternative techniques, like adding a timestamp column to the PI to address issues of skew that can be used in place of using a surrogate key.
- Occasionally, surrogate keys are used simply to achieve a clean snow-flake design. In these cases, it’s becomes an engineering question of balancing the new trade-offs of using IDENTITY columns (not writing to them via Unity Director/Loader) versus the benefits (conformance to a particular design ideal).
Consider moving the surrogate key function to the client
Note that while IDENTITY columns do generate sequential numbers, this is actually not a requirement for surrogate keys. Surrogate keys require unique numbers, not necessarily sequential numbers. For a variety of reasons, sequential numbers can be more work for databases to deal with. For a vast majority of cases, substituting a client generated value for a surrogate key is an ideal solution. Any time the client application is in-house or custom build, this should be considered. Almost all programming languages have UUID (universally unique id) functions built in. Most ETL applications (E.g. Abinitio, DataStage, Informatica) also have built-in functions to generate surrogate keys for tables.
Using a client side function eliminates any compatibility issue with Unity Director/Loader, and in many cases will allow Director/Loader to make better locking choices since it can now see the value of the surrogate key that is used in the Primary Index of a table, which will improve concurrency in some situations.
Here’s an example in java of a prepared statement using Java’s UUID function to generate a surrogate key. In this example, the UUID is stored as a CHAR(36) in the database, instead of an INTEGER:
drop table tbl_emp_client_uuid; CREATE TABLE tbl_emp_client_uuid (id char(36), Name VARCHAR(20), Phone VARCHAR(10)) PRIMARY INDEX (NAME,PHONE); // INSERT statement String insertTable = "INSERT INTO " + tableName + " (id,NAME,Phone) VALUES (?,?,?)"; PreparedStatement pstmt=null; try { pstmt = con.prepareStatement(insertTable); pstmt.setString(1, UUID.randomUUID().toString()); pstmt.setString(2, name); pstmt.setString(3, phone); pstmt.execute(); } catch (SQLException ex) { handleError(ex); }
The result is a good surrogate key that will distribute well, and is globally unique:
select * from tbl_emp_client_uuid; *** Query completed. 1 rows found. 3 columns returned. *** Total elapsed time was 1 second. id Name Phone ------------------------------------ -------------------- ---------- 00694654-ba20-4bf6-b5df-cecda9b815ad Paul 1234567890
Implement your own sequential number key in the database
If you do require sequential numbers for surrogate keys, it is possible to implement a solution that uses the database to provide sequential numbers. There are several possible ways to implement this, but the most common is to use a key table to provide values for several different sequences:
CREATE SET TABLE SEQUENCES ( SEQ_Name VARCHAR(30), Next_Value DECIMAL(18,0) TITLE 'Next sequence value' ) UNIQUE PRIMARY INDEX (SEQ_Name);
New sequences are defined by insert rows into this table. The Next_Value column determines the next value taken by a client application.
insert into sequences values ('emp_id',1); *** Insert completed. One row added. *** Total elapsed time was 1 second.
To use these sequences, client applications must allocate blocks of sequences values. A simple stored procedure might be used encapsulates the request to do this:
REPLACE PROCEDURE getSequenceValues ( IN Name VARCHAR(30), IN neededValues DECIMAL(18,0), OUT nextValue DECIMAL(18,0) ) BEGIN BEGIN TRANSACTION; LOCKING ROW FOR WRITE SELECT Next_Value into nextValue FROM sequences WHERE SEQ_NAME = Name; UPDATE sequences SET Next_Value = Next_Value + neededValues WHERE SEQ_NAME = Name; END TRANSACTION; END;
…or this could be left as a client-side function. In this example, a client could allocate a set of 10 values using the stored procedure:
call getSequenceValues('emp_id',10,NextVal); *** Procedure has been executed. *** Total elapsed time was 1 second. nextValue -------------------- 31.
… and then use them by setting the value of the id column (in the same way it set the ID column to the UUID value above, only with an integer) and increment the id as it consumes the 10 values it allocated. Note that this is still, in a slightly different form, using a client-side solution to generate surrogate keys. The only difference is it relies on the value in a table inside the database to ensure uniqueness.
It’s also important to note that allocating a single value at a time when requiring many values could lead to scalability problems since Unity Director and Loader will acquire a table-level lock on the sequences table when the getSequenceValue stored procedure is called. This stored procedure call should be very fast, since it’s only doing a single-row update, but with high concurrency, frequent calls could cause serialization across all the processes that use it for surrogate keys.
Build your own internal database UUID
All of the above techniques require some application changes to accommodate them. As a last resort, there is one technique that doesn’t require any client changes, but does come with a performance cost. It is possible to implement your own UUID function inside the database that is deterministic across Teradata systems, and to define a trigger to set this value as a surrogate key.
This function uses the RANDOM and CURRENT_TIMESTAMP to create a unique id for a surrogate key, that will also guarantee deterministic values when executed on multiple Teradata systems (above version 14.00.02) when used with Unity Director and Loader. This function relies on the fact that for connections that go through Unity Director and Loader, Director supplies the current time and random seed to the internal database functions.
REPLACE FUNCTION syslib.UUID() RETURNS BIGINT LANGUAGE SQL CONTAINS SQL DETERMINISTIC CALLED ON NULL INPUT SQL SECURITY DEFINER COLLATION INVOKER INLINE TYPE 1 RETURN CAST( TRIM(CAST(Random(-92233703,922337203) AS CHAR(9) ))|| (CURRENT_TIMESTAMP(FORMAT 'MMDDHHMISS') (CHAR(10))) AS BIGINT); grant execute function on syslib.uuid to public;
Since this function returns a BIGINT instead of an INTEGER, a minor change is required to the employee table:
CREATE TABLE tbl_emp_new (id BIGINT, Name VARCHAR(20), Phone VARCHAR(10)) PRIMARY INDEX (NAME,PHONE);
To set the value of the id, a trigger is used that emulates a default value:
REPLACE TRIGGER etltest.tbl_emp_uuid BEFORE INSERT ON etltest.tbl_emp_new REFERENCING NEW AS NewRow FOR EACH ROW when (NewRow.id is null) ( set NewRow.id=syslib.UUID() ; );
The advantage of this approach is that no further client changes are required. The table now functions much like it did when it used the IDENTITY column:
*** Query completed. One row found. 3 columns returned. *** Total elapsed time was 1 second. id Name Phone -------------------- -------------------- ---------- 1899301991121182541 Paul 1234567890
There is however a high performance cost to this convenience. The following table shows a comparison of the time used to insert a varying number of rows into the tbl_emp table using several different methods.
- The first two scenarios show the baseline time direct to a single Teradata system (bypassing Unity Director) and to the table without any value for the ID column. Any client side function used to provide a surrogate key would provide similar performance.
- The second scenario shows the time to insert the same rows using the sequence method to allocate a set of sequencer values for the surrogate key. In this test, values were allocated in batches of 100. Since the overhead in this method comes from the extra stored procedure call, increasing the batch size would improve the efficiency of this method, while decreasing it would worsen the efficency.
- The last scenario shows the method using the custom deterministic UUID function and trigger to set the ID value in the table. Note that this method is 6-7 times as slow as the baseline for one session and ~40 times as slow for 10 sessions.
Reps: |
10 |
Baseline (TD direct with Identity col) |
Baseline (via Director, no ID column) |
Using Sequence Method |
Using UUID-trigger method |
||||||||
Rows: |
5000 |
Min |
Avg |
Max |
Min |
Avg |
Max |
Min |
Avg |
Max |
Min |
Avg |
Max |
Sessions |
Total Rows |
Sec |
Sec |
Sec |
Sec |
Sec |
Sec |
Sec |
Sec |
Sec |
Sec |
Sec |
Sec |
1 |
50000 |
9 |
9.7 |
10 |
10 |
10.5 |
12 |
12 |
12.6 |
14 |
66 |
66.5 |
67 |
5 |
250000 |
11 |
15.26 |
19 |
11 |
12.78 |
16 |
56 |
70.74 |
91 |
326 |
340.45 |
357 |
10 |
500000 |
15 |
18.34 |
27 |
15 |
15.95 |
23 |
124 |
143.25 |
168 |
725 |
743.1 |
768 |
Selecting the best technique for you
An ounce of prevention is worth a pound of cure; if you are considering Unity Director or Loader for your multisystem environment at some point in the further, a best practice is to avoid implementing IDENTITY columns now. IDENTITY columns can pose problems for any active-active multisystem environment, regardless of the synchronzation techique used.
If the situation arises that you need to find an alternative to replace IDENTITY columns in order to implement Unity Loader, you should assess each of these methods based on the specific requirements for each of your tables. You might use a single technique for all your tables or several different approaches for different sets of tables. With careful consideration, the challenge of IDENTITY columns in a Unity Director/Loader environment can be easily overcome.
Paul,
the syslib.UUID() is clearly not usable as a UUID generator. The number of synonyms is fare to high.
With
select id, count(0) from ( select calendar_date, syslib.UUID() as id from sys_calendar.calendar ) as t group by 1 having count(*) > 1 order by 1;
I got already 2 synonyms - for only <80.000 values.
<800.000 rows I got already 279 synonyms...
select id, count(0) from ( select c.calendar_date, syslib.UUID() as id from sys_calendar.calendar c cross join sys_calendar.calendar j where j.day_of_calendar between 1 and 10 ) as t group by 1 having count(*) > 1 order by 1;
Your result might vary but running this some times should show the issue.
Reason is that the number range of Random(-92233703,922337203)
is fare to small! There exists some good formulas to calculate the probability of collisions.
SHA256 might be usable for this approach.
But as UUIDs are getting more and more common TD might consider implementing this as a core function - UDFs will always consume too much CPU...
Regards Ulrich
feel free to donate bitcoin:12kgAUHFUqvG2sQgaRBXFhCwyf9HXdkGud