All Forums Tools
ericsun2 44 posts Joined 06/10
19 Feb 2014
CHAR(n) CHAR SET UNICODE/LATIN define schema in 14.xx - multiply by 2 or 3

In TPT script that "USING CHARACTER SET UTF8" in 13.10
We multiply char_length() of a CHAR & VARCHAR column by 3, so for example,

  • CHAR(2) in DDL = CHAR(6) in TPT
  • VARCHAR(20) in DDL = VARCHAR(60) in TPT

It seems that in TPT 14.00 and 14.10, if we try to use UTF8, the following definition works

  • CHAR(2) CHAR SET UNICODE in DDL = CHAR(6) in TPT
  • CHAR(2) CHAR SET LATIN in DDL = CHAR(4) in TPT
  • VARCHAR(20) CHAR SET UNICODE in DDL = VARCHAR(60) in TPT
  • VARCHAR(20) CHAR SET LATIN in DDL = VARCHAR(60) in TPT

The following DDL will only work with the following TPT schema definition:

 CREATE SET TABLE tpt_data_type ,NO FALLBACK ,
     NO BEFORE JOURNAL,
     NO AFTER JOURNAL,
     CHECKSUM = DEFAULT,
     DEFAULT MERGEBLOCKRATIO
     (
      PRODUCT_ID INTEGER NOT NULL,
      ORDER_ID BIGINT NOT NULL,
      ORDER_DESC VARCHAR(100) CHARACTER SET UNICODE NOT CASESPECIFIC,
      STORE_CODE VARCHAR(20) CHARACTER SET LATIN NOT CASESPECIFIC,
      ACTIVE_FLAG CHAR(1) CHARACTER SET LATIN NOT CASESPECIFIC,
      UNICODE_FLAG CHAR(1) CHARACTER SET UNICODE NOT CASESPECIFIC,
      ORDER_AMT DECIMAL(15,2),
      ORDER_TIME TIMESTAMP(0),
      ORDER_STATE CHAR(1) CHARACTER SET LATIN NOT CASESPECIFIC,
      UPDATED_TS TIMESTAMP(0))
PRIMARY INDEX ( ORDER_ID ); 
USING CHARACTER SET UTF8 
DEFINE JOB EXPORT_TO_FASTLOAD_FORMAT
DESCRIPTION 'Export from ' || @SourceTableName || ' to the INDICDATA file: ' || @DataFileName
(
DEFINE SCHEMA DATA_FILE_SCHEMA
(
"PRODUCT_ID" Int,
"ORDER_ID" BigInt,
"ORDER_DESC" Varchar(300),	/* 100 x 3 */
"STORE_CODE" Varchar(60),	/* 20 x 3 */
"ACTIVE_FLAG" Char(2),		/* 1 x 2 */
"UNICODE_FLAG" Char(3),         /* 1 x 3 */
"ORDER_AMT" Decimal(15,2),
"ORDER_TIME" Timestamp(0),
"ORDER_STATE" Char(2),		/* 1 x 2 */
"UPDATED_TS" Timestamp(0)
);

  DEFINE OPERATOR EXPORT_OPERATOR
  TYPE EXPORT
  SCHEMA DATA_FILE_SCHEMA
  ATTRIBUTES
  (
  VARCHAR PrivateLogName    = @SourceTableName || '_log',
  VARCHAR TdpId             = @TdpId,
  VARCHAR UserName          = @UserName,
  VARCHAR UserPassword      = @UserPassword,
  VARCHAR QueryBandSessInfo = 'Action=TPT_EXPORT; Format=Fastload;',
  VARCHAR SpoolMode         = 'noSpool',
  INTEGER MaxDecimalDigits  = 18,
  VARCHAR DateForm          = 'INTEGERDATE',
  VARCHAR SelectStmt        = 'select * from ' || @SourceTableName  
  );

  DEFINE OPERATOR FILE_WRITER
  TYPE DATACONNECTOR CONSUMER
  SCHEMA *
  ATTRIBUTES
  (
  VARCHAR PrivateLogName   = 'indicdata_writor_log',
  VARCHAR DirectoryPath    = @DataFilePath,
  VARCHAR FileName         = @DataFileName,
  VARCHAR Format           = 'Formatted',
  VARCHAR OpenMode         = 'Write',
  VARCHAR IndicatorMode    = 'Y'
  );
  
  APPLY TO OPERATOR (FILE_WRITER[@DataFileCount])
  SELECT * FROM OPERATOR (EXPORT_OPERATOR[@NumOfReader]);
);
$ hexdump -C /var/tmp/tpt_data_type.fastload 
00000000  64 00 00 00 12 27 00 00  15 cd 5b 07 00 00 00 00  |d....'...�[.....|
00000010  19 00 44 55 4d 4d 59 20  44 45 53 43 20 e6 95 99  |..DUMMY DESC �..|
00000020  e5 ad a6 e8 ae be e8 ae  a1 20 21 04 00 53 45 41  |学设计 !..SEA|
00000030  32 59 20 e5 85 a8 bf 09  00 00 00 00 00 00 32 30  |2Y �.��.......20|
00000040  31 34 2d 30 32 2d 31 32  20 31 32 3a 31 32 3a 32  |14-02-12 12:12:2|
00000050  32 50 20 32 30 31 34 2d  30 32 2d 31 39 20 31 37  |2P 2014-02-19 17|
00000060  3a 32 37 3a 35 37 0a 64  00 00 00 19 27 00 00 45  |:27:57.d....'..E|
00000070  ef 54 07 00 00 00 00 19  00 44 55 4d 4d 59 20 44  |�T.......DUMMY D|
00000080  45 53 43 20 e5 af bb e6  89 be e9 87 91 e6 98 9f  |ESC 寻�.��..�..|
00000090  20 21 04 00 4c 41 58 33  4e 20 e5 85 8d 83 13 00  | !..LAX3N �.....|
000000a0  00 00 00 00 00 32 30 31  34 2d 30 31 2d 31 31 20  |.....2014-01-11 |
000000b0  31 31 3a 31 31 3a 31 31  57 20 32 30 31 34 2d 30  |11:11:11W 2014-0|
000000c0  32 2d 31 39 20 31 37 3a  32 37 3a 35 37 0a        |2-19 17:27:57.|
000000ce

The CHAR(n) CHAR SET LATIN behaves differently than previous version. Can someone please confirm?
 

ericsun2 44 posts Joined 06/10
20 Feb 2014

According to 2436.pdf - TPT Reference

Use of UNICODE Affects Column Width Requirements
When specifying a UTF8 character set in a Teradata PT script the output schema definition must define column widths three times larger.
When using the UTF16 character set in a Teradata PT job script, the output SCHEMA definition must define column widths two times larger. The width values must be an even and positive number.

I don't see anything special for CHAR(n)

ericsun2 44 posts Joined 06/10
20 Feb 2014

I double-checked the above observation against the auto-generated schema, and it confirms that "CHAR(n) CHAR SET LATIN" will be estimated as "CHAR(n x 2)" in TPT definition when EXPORT/LOAD as UTF8 data file.
DEFINE SCHEMA DATA_FILE_SCHEMA FROM TABLE 'tpt_data_type';

     DataConnector operator for Linux release 2.6.32-358.6.2.el6.x86_64 on eat1-etl07.corp
     TDICU................................... 14.10.00.00
     PXICU................................... 14.10.00.03
     PMPROCS................................. 14.10.00.07
     PMRWFMT................................. 14.10.00.03
     PMTRCE.................................. 13.00.00.02
     PMMM.................................... 03.00.00.01
     DCUDDI.................................. 14.10.00.43
     PMHEXDMP................................ 14.10.00.01
     PMUNXDSK................................ 14.10.00.10
      
     UseGeneralUDDIcase: 'NO (defaulted)' (=0)
     WriteBOM: 'NO (defaulted)' (=0)
     AcceptExcessColumns: 'NO (defaulted)' (=0)
     AcceptMissingColumns: 'NO (defaulted)' (=0)
     TruncateColumnData: 'NO (defaulted)' (=0)
     TruncateColumns: 'NO (defaulted)' (=0)
     TruncateLongCols: 'NO (defaulted)' (=0)
     WARNING!  RecordErrorFilePrefix attribute not specified, there is no default
     RecordErrorVerbosity: OFF (default) (=0)
     FileName: 'tpt_data_type.fastload'
     OpenMode: 'WRITE' (2)
     Format: 'FORMATTED' (3)
     IOBufferSize: 131072
      
     Full File Path: /var/tmp/tpt_data_type.fastload
     Data Type              Ind  Length  Offset M
              INTEGER (  1)   1       4       0 N
               BIGINT ( 37)   1       8       4 N
              VARCHAR (  7)   1     300      12 N
              VARCHAR (  7)   1      40     312 N
                 CHAR (  5)   1       2     352 N
                 CHAR (  5)   1       3     354 N
              DECIMAL (  4)   1       8     357 N
                 CHAR (  5)   1      19     365 N
                 CHAR (  5)   1       2     384 N
                 CHAR (  5)   1      19     386 N
     Schema is not all character data
     Schema is compatible with delimited data

     ===================================================================
     =                                                                 =
     =                     Column/Field Definition                     =
     =                                                                 =
     ===================================================================

     Column Name                    Offset Length Type      
     ============================== ====== ====== ========================
     "PRODUCT_ID"                        0      4 INTEGER
     "ORDER_ID"                          4      8 BIGINT
     "ORDER_DESC"                       12    300 VARCHAR
     "STORE_CODE"                      314     40 VARCHAR
     "ACTIVE_FLAG"                     356      2 CHAR
     "UNICODE_FLAG"                    358      3 CHAR
     "ORDER_AMT"                       361      8 DECIMAL(15,2)
     "ORDER_TIME"                      369     19 CHAR
     "ORDER_STATE"                     388      2 CHAR
     "UPDATED_TS"                      390     19 CHAR
     ============================== ====== ====== ========================
     INDICATOR BYTES NEEDED: 2
     EXPECTED RECORD LENGTH: 411

 

eejimkos 73 posts Joined 01/12
30 May 2014

Hello,
Can i ask something about this?
This means that during the proccess TD transfers 3 times or two times more than needed space?( if we had only varchar columns)
 
Thank you.
 

feinholz 1234 posts Joined 05/08
30 May 2014

TPT does not know anything about the table definition of the source table, whether the columns are of type LATIN or UNICODE.
 
We only work off of the client session character set.
Database column definitions are in terms of characters.
TPT schema is in terms of bytes.
Any byte in the UTF8 client session character set can be 1, 2, or 3 bytes.
And TPT does not know which.
Thus, we will document that the schema definition must be 3x the size of the table definition so that enough space is reserved for the data.
 
The actual bytes exported from Teradata are controlled by several factors, of which column definition and database export width are some factors.
 
(And this is just CHAR/VARCHAR. It gets worse with the TIMESTAMP data types.)
 

--SteveF

eejimkos 73 posts Joined 01/12
30 May 2014

Thanks very much your answer.
To be honest i have not worked  a lot with TD loading tools  , but i am trying to understand only one thing about them.(network traffic)
If for example i have one table with two column 
clm1 char(10) latin
clm2 varchar(10) unicode.
and i want to insert to insert 1 M rows , no nulls.
 
From this example , i need at least
clm1 : 1M * 10 * 2 bytes
clm2 : 1M * 10 * 2 bytes
 
I meant before that a defined DDL must have a correspondind schema on TPT. 
My question is the following , if my real data are not 10 characters , does TPT send all the size? or it just sends the characters it founds.
Would it be better , if i know that my data are maximum 5 character for every columns , to change and on the table and on the TPT script the defined length to (5) , in order to decrease the time of sending the data or even to use less resources on the client side to create the data that must trasfer to TD.
 
Thank you.
 
 
 
 
 
 
 
 
 
 

feinholz 1234 posts Joined 05/08
30 May 2014

If you are uncertain about your data, you are better off using VARCHAR instead of CHAR.
Less traffic across the network (as you pointed out). We only send what we receive.
As far as resources to create the data, there is no difference.
 

--SteveF

You must sign in to leave a comment.