All Forums Tools
hans-peter 7 posts Joined 11/04
31 Jan 2012
BTEQ Output: Unicode Column wider as defined in DDL or FORMAT

my question is about the output of a UNICODE column in BTEQ.

If I select a column defined like this...
   txt VARCHAR(11) CHARACTER SET UNICODE CASESPECIFIC,
.. it is not 11 characters wide , but it is 33 characters wide
(3 times the number of VARCHARs)

The bteq session character set is UTF8: .set session charset 'UTF8';
 I've already tried some CAST and FORMAT commands. For example, if I use cast(substr(txt,1,4)  as char(4) format 'X(4)') c the column requires 12 characters in BTEQ Output (4 characters multiplied by 3)
If I use an ASCII session the size of the column is 11 characters wide as expected.

In the Teradata documentation and in an orange book about "International Character Sets" I read that when a Unicode column is exported in UTF8 character set, it is always 3 times the length. (why ..?)

"My" unicode Colum contains words in different european, esp. eastern-european languges, which do not require the space of 3 'characters', but I have to use UTF8 session to display those eastern-european characters correctly (using ASCII session character does not display those characters correctly..).

So, at the moment, I see no possibiliy to format such a unicode column in a desired length ?!

Has someone any ideas to help me ... thanks in advance..

ulrich 816 posts Joined 09/09
31 Jan 2012

don't mix the number characters with the length in bytes which are needed to store this character

in latin the ratio is 1:1

in utf8 it is 1:3

So what is your problem with bteq?

can you share the code?

feel free to donate bitcoin:12kgAUHFUqvG2sQgaRBXFhCwyf9HXdkGud

hans-peter 7 posts Joined 11/04
31 Jan 2012

I agree that the length in bytes to STORE a character in UTF8 can be up to 3 bytes.  But I don't understand why 3 characters are needed to display it in BTEQ....

So the column 'txt' in the following example can store 11 characters, and the space required to store it is up to 33 bytes. But why is the space to display it in a BTEQ-Report 33 characters..?

 

Here is the Example:

in SQLA: (but that should make no difference, could also be BTEQ..)

DATABASE testdb;

CREATE SET TABLE uc_test ,NO FALLBACK ,
     NO BEFORE JOURNAL,
     NO AFTER JOURNAL,
     CHECKSUM = DEFAULT
     (
     id INTEGER,
      txt VARCHAR(11) CHARACTER SET UNICODE CASESPECIFIC
      )
   UNIQUE PRIMARY INDEX PI_uc_test (id)

   insert  into uc_test values (1, 'Zeile1');

$ bteq

 Teradata BTEQ 12.00.00.09 for LINUX.
 Copyright 1984-2010, NCR Corporation. ALL RIGHTS RESERVED.
 Enter your logon or BTEQ command:
.set session transaction ansi;

.set session transaction ansi;
 Teradata BTEQ 12.00.00.09 for LINUX. Enter your logon or BTEQ command:
.set session charset 'UTF8';

.set session charset 'UTF8';
 Teradata BTEQ 12.00.00.09 for LINUX. Enter your logon or BTEQ command:
.logon TDUSER

.logon TDUSER
Password:

 *** Logon successfully completed.
 *** Teradata Database Release is 12.00.03.03
 *** Teradata Database Version is 12.00.03.25a
 *** Transaction Semantics are ANSI.
 *** Character Set Name is 'UTF8'.

 *** Total elapsed time was 1 second.

 BTEQ -- Enter your DBC/SQL request or BTEQ command:
database testdb;

database testdb;

 *** New default database accepted.
 *** Total elapsed time was 1 second.

 BTEQ -- Enter your DBC/SQL request or BTEQ command:
commit work;

commit work;

 *** COMMIT done.
 *** Total elapsed time was 1 second.

 BTEQ -- Enter your DBC/SQL request or BTEQ command:
select id,txt from uc_test;

select id,txt from uc_test;

 *** Query completed. One row found. 2 columns returned.
 *** Total elapsed time was 1 second.

         id  txt
-----------  ---------------------------------
          1  Zeile1
 

 

--> in this output there are 33 characters (Number of characters, not length in bytes) used for the field 'txt' . I would like to have 11 characters....

ulrich 816 posts Joined 09/09
01 Feb 2012

I check it with some real Unicode values (and used only char(2)) in the table definition.

select * from uc_test;

 

 

 *** Query completed. 4 rows found. 2 columns returned. 

 *** Total elapsed time was 1 second.

 

         id  txt

-----------  ------

          1  ZZ

          4  शक

          3  ვე

          2  ₪Z

 

it is as you stated - 3 * chars seems to be reserved but are not needed here.

title seems also not help - wide seems to be detected by max of title length and calculated column length.

Maybe - but this is just a guess - this is related to the fact that BTEQ can be used as well for exporting data, where the byte length becomes an issue..

 

feel free to donate bitcoin:12kgAUHFUqvG2sQgaRBXFhCwyf9HXdkGud

hans-peter 7 posts Joined 11/04
01 Feb 2012

... I see, that might be the reason....

The problem occured when we changed a text-column from Latin to Unicode character set in the DDL. Afterwards the exportfile of a BTEQ report, which contains several columns in fixed character length didn' t fit on a paper page anymore, because the text columns are 3 times wider as before.

I understand that if the unicode column contains "letters", which require 3 Bytes in UTF8-Encoding, then the export of such a letter requires 3 Bytes (or characters..?) in an export produced by BTEQ in UTF-8-Encoding. To have a fixed length of a column for all possible cases, BTEQ uses the max. wide, which is 3 times the number of characters.

What I am searching for is the possibility to specify the character length in BTEQ output, but I am afraid that there is no way to do that (using SQL/BTEQ). ?!

 

ulrich 816 posts Joined 09/09
01 Feb 2012

Not sure that I understand your last question - you can check dbc.columns

column length should give you the number of bytes needed. CharType is giving you the char type.

TD does internally store the utf data as utf16 which means character * 2 = columnlength.

So for chartype = 2 you need to do columnlength / 2 * 3 to get the length in bteq for UDF8.

Is this what you have looking for?

feel free to donate bitcoin:12kgAUHFUqvG2sQgaRBXFhCwyf9HXdkGud

hans-peter 7 posts Joined 11/04
01 Feb 2012

I agree to your explanation, the max length can be computed like that using dbc.colums. But I am looking for a possibility to specify, that the BTEQ-Output of a UNICODE column does not require 3 times the number of characters that is defined in the DDL  - e.g. using FORMAT, CAST, another charset, <other ideas>...?

And - After all I still don't  see, why the export of a UNICODE column in UTF-8 characterset is implemented in BTEQ as it is. Why wouldn't it be correct to fill missing characters (if its VARCHAR..) with spaces up to the max number of characters, and not the max number of bytes used to encode the characters in UTF-8 ? (in your example: the char(2) field always contains 2 characters; they are exported correctly, where the asian characters probably use 2 or 3 bytes; in a browser/editor/.. that can display these UNICODE characters, each of them requires the space of one character; so why the 4 extra spaces ..)

ulrich 816 posts Joined 09/09
01 Feb 2012

Can't say why it is finally the way as it is - maybe someone from TD add some infos. 

Personally bet would be currently that you won't find a solution in BTEQ as long as you have to stick with the UTF8.

feel free to donate bitcoin:12kgAUHFUqvG2sQgaRBXFhCwyf9HXdkGud

hans-peter 7 posts Joined 11/04
01 Feb 2012

thanks so far ! If someone else can add some infos, feel free to do it ...

ericsun2 44 posts Joined 06/10
09 Mar 2013

It is a pretty common yet often overlooked standard, while export/import database column to/from files (binary or flat) in the UTF8 mode, the VARCHAR(**) columns in TD (or VARCHAR(**) in ORACLE) is always calcuated as ** X 3 bytes regardless the DDL uses Latin or Unicode (ORACLE DDL uses BYTE or CHAR). This is the same rule used in OCI and SQL*Loader as well.
The funny thing is, if you switch from UTF8 mode to UTF16 mode in BTEQ/FEXP/TPT, the estimated column size will be ** X 2 bytes, which is smaller than ** X 3 bytes. This sounds unbelievable, but folks should check it out.
I bet folks all have run into the following error in BTEQ/ODBC/JDBC/TPT sometimes in life, the rows on disk are definitely smaller than 64KB, but you just can't retrieve them without chopping off several columns! Chances are if you change your session charset from UTF8 to UTF16, you may avoid this annoying error in some cases because the estimation for VARCHAR/CHAR columns in UTF16 mode is 33% smaller than UTF8 mode:
[Error 3577] [SQLState HY000] Row size or Sort Key size overflow.
The reality is we need unicode support for multi-language, but only NAME, ADDRESS, DESCRIPTION type of columns need Unicode storage, the rest VARCHAR/CHAR columns, such as CODE or ID, are happy with Latin. A feature request to Teradata team: when come to estimate the row size for client/server communication parcels (BTEQ, FEXP, TPT, ODBC, JDBC), it will be great

  • in UTF8 mode: to use ** X 1 for Latin, to use ** X 3 for Unicode
  • in UTF16 mode: to use ** X 1 for Latin, to use ** X 2 for Unicode
jnevill 17 posts Joined 03/12
22 Aug 2014

This thread is stale, but I've found no reasonable solution to the issue where BTEQ outputs 3x the number of characters necessary when the session's charset is set to 'UTF8'. I understand fully that UTF8 requires 3x the number of bytes in order to store the BOM for each character, but it seems odd that BTEQ translates 3x the bytes to 3x the characters in output. 
Just the same, if you are executing BTEQ from a bash script for export, you can use a simple sed command to remove repeated spaces in a file. This will remove ALL repeated spaces, regardless of whether they are quote encapsulated or not. If you wish to get fancier, then you can probably run similar regex through awk.

cat inputfile | sed 's/^ *//;s/ *$//;s/ \{1,\}/ /g' > outputfile

 

david.craig 73 posts Joined 05/13
22 Aug 2014

On the server you can override the default export-width at the session level. For example, you could set the Unicode to UTF8 width to 2 bytes per character if your worst case UTF8 export width is Latin based scripts (i.e., U+07FF and below). There is information in the reference manuals on export width.
Also the server does not import or export the byte-order mark (BOM).  For more info see: http://developer.teradata.com/tools/articles/whats-a-bom-and-why-do-i-care

You must sign in to leave a comment.