Workstation BTEQ recently added Unicode Support to its list of capabilities.  This article will explain to you how to start a Unicode BTEQ session in both interactive and batch mode. Command line options have been provided to give you control and flexibility to execute BTEQ in various Unicode environments, while preserving BTEQ’s legacy behavior.

The following information is intended for users who are familiar with Unicode and have used BTEQ in the past. For more information on Unicode, please see the Unicode Consortium website (www.unicode.org). For more information on BTEQ, please read the Basic Teradata Query Reference Manual (Publication B035-2414).

Unicode Encodings Supported

BTEQ supports the UTF-8 and UTF-16 encoding forms of Unicode. Only the characters within the Basic Multilingual Plane (BMP) are supported by both BTEQ and the Teradata Database. Starting with the 13.10.0.0 release, BTEQ will begin supporting supplementary characters, but only for BTEQ commands, not SQL statements.

I/O Encoding vs. Session Character Set Encoding

Workstation BTEQ provides two encoding controls for supporting Unicode. The first control (I/O Encoding), handles the user interface for BTEQ. On input, that includes stdin and RUN files. On output, it includes stdout, stderr, and MESSAGEOUT files.

The second control (Session Character Set Encoding), handles the communication to/from the Teradata Database and the encoding of certain files, which include import files, export files, and source files for SQL (internal) stored procedures.  Other external source files (for user-defined functions, user-defined methods, and external stored procedures) are an exception and must always be in the native encoding (ASCII for workstation clients or EBCDIC for mainframe clients).

Having two encoding values allows you to customize the BTEQ execution environment. All files and communication are in the same encoding or you can define UTF-8 for the I/O encoding and UTF-16 for the session character set encoding, or visa versa.

The diagram below shows how the two encoding controls affect BTEQ.
Two encoding controls that BTEQ uses

All Unicode input and output files may optionally contain a Unicode Byte Order Mark (BOM), with the exclusion of stdout/stderr. The endianness of a UTF-16 file (including its BOM) must match the machine architecture for which BTEQ is running on. UTF-8 files are not byte order dependent, therefore, Workstation BTEQ can handle them on any supported platform.

BTEQ Command Line Options

When you invoke BTEQ, a session is started, and when you execute a QUIT or EXIT command, the session is terminated. If the I/O encoding and session character set encoding are the same, then the BTEQ session is defined by that singular encoding. By default Workstation BTEQ creates an ASCII session, but an ASCII session cannot handle Unicode interaction. In order to create a Unicode session, BTEQ should be executed using the following command line options.

-c option

This option defines the session character set encoding for a Unicode session and takes an argument which can be any supported character set value. See the International Character Set Support manual (Publication B035-1125) for a list of supported values. Using this option would be analogous to submitting the SESSION CHARSET command at BTEQ startup time. An alternative to using the -c option is to define the session character set using the "charset_id" entry in the clispb.dat file (see the Teradata Call-Level Interface Version 2 Reference for Network-Attached Systems). If both the -c option and clispb.dat file are used for defining the session character set value, the -c option has precedence.

The -c option is available starting with BTEQ 8.2.3.0

Note: The I/O encoding will default to the session character set value if either the -e or –m options are not used (see below).

-e option

This option defines the I/O encoding for a Unicode session and takes a character set argument. The only possible values are:
• UTF8
• UTF16

This option is only valid when you define the session character set as 'UTF8' or 'UTF16' with the -c option or within the clispb.dat file. It allows you to communicate with BTEQ in one Unicode encoding while allowing communication to the Teradata Database in another Unicode encoding. For example:
bteq -e utf8 -c utf16 < input.txt > output.txt 2>&1

In this case, input.txt and output.txt will be encoded in UTF-8. However, the communication to/from the Teradata Database is in UTF-16.

The –e option is available starting with BTEQ 8.2.3.0

Use this option only for executing BTEQ in batch mode when you want the I/O encoding and session character set encoding to be different Unicode transformation formats. For interactive mode, use the -m option (see below) or BTEQWIN.

-m option

The -m option indicates that I/O for an interactive Unicode session will be encoded in multi-byte characters based on the system locale, which is completely configurable by the user (as will be discussed in the next section). This allows for the proper interactive reading and writing of non-Latin characters. This option does not take any arguments and similar to the –e option, it is only valid when you define the session character set as 'UTF8' or 'UTF16' with the -c option or within the clispb.dat file.

When the I/O encoding is controlled by the -m option, RUN files must be encoded in multi-byte characters based on the locale, not UTF-8 or UTF-16. Likewise, Workstation BTEQ will write multi-byte characters to MESSAGEOUT files. And be aware that BTEQ will be limited to the characters that the current locale can handle. For example, Thai characters may not be read or displayed properly when a Japanese locale is used (see the screen example below). Characters outside the system's locale will most likely be displayed as a question mark ('?'). 

Unsupported characters are displayed as question marks

If an interactive Unicode session is started without the -m option, then I/O may not work as expected, because the default “C” locale is used. The “C” locale only consists of single-byte ASCII characters. Workstation BTEQ will read in stdin as an ASCII stream, severely limiting the input character range. Stdout/stderr will be written to as a UTF-8 or UTF-16 stream, but characters may be displayed incorrectly due to the default locale setting.

The –m option is available starting with BTEQ 13.10.0.0.

Defining the system locale on Windows for the –m option

• Go to CONTROL PANEL.
• Select REGIONAL AND LANGUAGES OPTION.
• Select a value for "Language for non-Unicode programs". You may have to install appropriate language support first if the language you require is not available. Here is what the screen looks like on Windows XP:

Windows Lanuage Settings

• A reboot may be necessary.

Defining the locale on Unix for the –m option

• Make sure the machine has appropriate language support software installed. Multiple packages may be required.
• Use "locale -a" to view the available locales.
• Set the LC_CTYPE environment variable to an available locale.

Command Line Option Considerations

When a Unicode input file is redirected through stdin for batch mode, it is highly recommended that the I/O encoding be initially set with the -c option, -e option, or clispb.dat file. Otherwise, Workstation BTEQ will try to interpret whether the input file is Unicode and will set the session character set value to what it thinks is correct, as follows. 

  • BTEQ will check the input file for a leading UTF-16 or UTF-8 BOM and will automatically change the session character set to the appropriate Unicode encoding. 
  • If a BOM-less UTF-16 input file is used, BTEQ will validate that the first two bytes actually make up a valid UTF-16 character before automatically changing the session character set to ‘UTF16’. Note that BTEQ will not automatically change the session character set for a BOM-less UTF-8 input file, and will assume ASCII instead. 
  • If none of the above apply, BTEQ will assume the input file is not Unicode and will default the session character set to ASCII.

Once a UTF-16 session has been started, Workstation BTEQ does not allow the I/O encoding or session character set encoding to be changed with the SESSION CHARSET command. This prevents files from getting corrupted with a mixture of incompatible encodings.

When any of these options and BTEQ commands are specified together on the command line, the command line options must be listed first.

Batch Invocation Examples

bteq -c utf16 < input.txt > output.txt 2>&1

This is an example of a UTF-16 batch session. Both the I/O encoding and the session character encoding are UTF-16, which means all communication and all files are UTF-16 encoded. BTEQ is driven by the UTF-16 input script, input.txt, which can optionally contain a UTF-16 BOM. Stdout and stderr are stored in output.txt, a UTF-16 file which does not contain a BOM. Both input.txt and output.txt can be viewed using an appropriate Unicode editor (like Windows Notepad).

bteq -e utf8 -c utf16 < input.txt > output.txt 2>&1

This is an example of a batch session, where the I/O encoding is UTF-8, but the session character set is UTF-16. This means that input.txt and output.txt are both UTF-8 encoded, while the communication with the Teradata Database is in UTF-16.

bteq < input.txt > output.txt 2>&1

This is another batch session, but let's assume that input.txt is a UTF-8 file containing a BOM. In this case, BTEQ would automatically set the I/O encoding and the session character set encoding to UTF-8. This means that output.txt and the data sent to/from the Teradata Database will also be UTF-8 encoded.

Interactive Invocation Examples

bteq -m -c utf16

This is an example of how to start an interactive Unicode session. BTEQ assumes that the system locale is already set and that all I/O will be encoded in multi-byte characters based on that locale. UTF-16 encoded RUN files are not supported. The communication with the Teradata Database and all data files will be UTF-16 encoded. The –m option allows non-Latin characters to be displayed correctly, as shown below.

Proper interactive Unicode session

bteq –c utf8 –m .SHOW VERSIONS

This is similar to the previous example, except that the communication with the Teradata Database and all data files will be UTF-8 encoded. Additionally, the first BTEQ command to be executed will be SHOW VERSIONS. Notice that when encoding control options and BTEQ commands are specified together on the command line, all encoding control options must be listed first.

bteq -c utf16

This is an example of how NOT to start an interactive Unicode session. Since the -m option is not used, Workstation BTEQ uses the default locale ("C"). UTF-16 encoded RUN files are supported, but all interactive input is limited to single-byte ASCII characters. And output is written using double-byte UTF-16 characters, which may be displayed with extra spacing or garbage, depending upon the operating system being used, as seen below in the third column. 

Interactive Unicode without the -m option

Determining BTEQ’s Encoding Values

To see exactly what I/O encoding and session character set encoding BTEQ is using, execute the SHOW CONTROLS CHARSET command. It will display the I/O encoding value along with the session charset value. 

SHOW CONTROL CHARSET example

Using Fonts to View Output

When output is saved from a Unicode batch session, use an appropriate Unicode editor to view the output file. Be aware that the font controls what type of glyphs are displayed within your editor. Most editors provide an array of fonts, however, only some display non-Latin glyphs correctly. Experiment to see which fonts serve your purpose the best. Start off by trying some of these fonts, which are available through many Microsoft Windows applications:

  • Arial Unicode MS
  • Courier New
  • Lucinda Console
  • Microsoft Sans Serif
  • Tahoma

Here is an example of a UTF-16 file displayed using the Terminal font within Windows Notepad. Notice many of the non-Latin glyphs cannot be displayed. 

Incorrect Unicode display

Here is the same file displayed using the Courier New font. The non-Latin glyphs are displayed correctly.

Correct Unicode display

The REPORTALIGN Command

In addition to using different fonts, a further way to control how UTF-8 output is displayed is to use the REPORTALIGN command. This command allows UTF-8 characters to be printed in one of 3 different ways. The following are the 3 arguments for REPORTALIGN which provide this flexibility.

  • COMPATIBLE - Each byte consumes a print position. For example, a single 3-byte UTF-8 character will physically take up 3 print positions. This is the default.
  • EQUALWIDTH - Each character consumes just one print position, no matter how many bytes make up the character.
  • COLUMNS - Wide and Fullwidth characters consume two print positions, while all other characters take up just one print position. The Unicode Character Database is used to determine which characters are Wide and Fullwidth.

This command is available starting with BTEQ 13.0.0.0. Below is an example showing how the three different options affect the output alignment. Notice the spacing in between columns for each REPORTALIGN option, and that COMPATIBLE allows more data truncation than the other two options.

REPORTALIGN examples

Refer to Chapter 5 in the Basic Teradata Query Reference Manual for more information on the REPORTALIGN command.

Windows BTEQ Shortcut Customization

If you prefer to start Windows BTEQ from the Start...Programs...Teradata Client shortcut, there is a way to customize it so that you always start BTEQ with a Unicode session. Simply right click on the Start...Programs...Teradata Client...Teradata BTEQ icon, and then left click on Properties. In the Target box, append the command line options of your choice (outside of any quotes). Click the Apply button, then OK.

BTEQWIN

If you require an interactive Unicode session on Windows, consider using BTEQWIN, which is a true Unicode Windows application, instead of the BTEQ command line interface. When you start BTEQWIN, it first displays the "Select Session Charset Dialog", which gives you a choice of four session character set values, including “UTF8” and “UTF16”. Once a choice has been made, a BTEQ session is opened up within a BTEQWIN window. It executes as if you started BTEQ with the -c option. Both the I/O encoding and the session character set encoding are based on your choice from the "Select Session Charset Dialog". Since BTEQWIN is a Unicode application, your I/O is not limited to the current system locale setting. BTEQWIN can read and write all Unicode characters within the BMP.

BTEQWIN supports a variety of fonts, many of which do not display all Unicode characters properly. Remember that the font plays an important role in how the output is displayed. Adverse affects may occur if you choose a font which is not Unicode capable. It is best to try out different fonts to see which one performs the best for your needs.

Valid Code Points

When executing Workstation BTEQ with a Unicode script, BTEQ commands are still made up of Latin characters within the range U+0020 through U+007E. Although Unicode provides alternatives for some of these basic characters (such as U+FF10 – Fullwidth Digit Zero or U+FF21 – Fullwidth Latin Capital Letter A), BTEQ does not recognize these as valid code points for BTEQ commands.

Conclusion

Workstation BTEQ provides flexibility, allowing users various ways in which a Unicode session can be executed. It is up to you to determine which way is the most productive for your environment.