

Teradata Parallel Transporter Supports Quoted VARTEXT in TTU14.00
We have made great strides in improving our handling of delimited data (i.e. CSV data) in Teradata Parallel Transporter for the TTU14.00 release. This article will describe the background of the data format, the original support, and the enhancements we have made.
Background and Theory
Comma-Separated Values
Comma-separated values (CSV) refers to a platform-independent data transport format, consisting of data values (expressed as character sequences), separated by commas. Of course, using comma as the value separator causes problems if the data values themselves include commas (for example, a single value containing both city and state, or a last-comma-first full name).
Delimiter-Separated Values
To avoid the problems of commas in data values, a delimiter-separated values (DSV) format was introduced. The DSV format allows the use of any separator, so that conflicts between data values and the delimiter can generally be avoided.
Although the delimiter is typically a single character, there is no technical reason that it cannot be a multi-character sequence.
Enclosing Data Values
Because there are cases where the nature of the data precludes certain knowledge as to what characters may or may not occur in the data values, quoted DSVs were introduced. Using quoted DSVs, the delimiter can occur within the data. The only thing that cannot occur within the data is the character used to enclose the data values.
General Rules
For the most generality, an application should accept the widest variety of DSV input, especially since there is no single widely accepted standard for CSV/DSV.
On the other hand, applications that emit CSV/DSV should be careful to limit their output so that it is acceptable to less forgiving receiving applications.
Teradata Implementation
Teradata’s Original Implementation
Teradata’s original implementation of CSV/DSV, referred to variously as VARTEXT or Delimited Data, had the following restrictions:
- The delimiter was limited to one single-byte character (default: |, the vertical bar/pipe).
- There was no support for quoted data values.
- Empty values (indicated, in the case of the first field, by a delimiter at the start of the line; in the case of all fields other than the first and last fields, by adjacent delimiters; or, in the case of the last field, by a delimiter immediately followed by end-of-line) were passed to the DBS as NULL.
The result of parsing each input line was a series of VARCHAR() fields, each holding the value of the corresponding input data value.
MBCS Support
Subsequently, support for the delimiter character was expanded to allow a single multi-byte character.
TPT Enhancements For TTU 14.00
The biggest and most important aspect of the delimited data enhancement in TPT14.00 is the support of quoted delimited data. This is very important because more and more customers are moving data from non-Teradata databases into Teradata. And the export tools being used to extract data from those non-Teradata databases often write out the data to flat files in delimited format, where one or more fields are enclosed in quotes.
New DataConnector Operator Attribute
In order to enable this new feature, we have introduced a new attribute to the TPT DataConnector operator:
QuotedData
-
No (default, quoted data is not supported)
- current rules apply
- quotes are considered a part of the data
-
Yes (all fields must be quoted)
- quotes are not considered to be part of the data
- Optional (fields may be a mixture of quoted and unquoted)
Enclosing Data Values
Although the term “quoted” may seem to imply that either single quotes (apostrophes) or double quotes (quotation marks) are used to enclose the values, which is not the case:
- The enclosing characters need not be quotes.
- The enclosing characters need not be single characters (that is, a multi-character sequence can be used)
- The open quote and close quote need not be the same; they can be distinct.
Note: for purposes of this article, “close quote” includes both a distinct close quote, and a common open and close quote.
Rules For Quotes
The following rules apply to both the open and close quote:
- If the open quote and close quote are distinct from each other, neither can be a substring of the other.
- Neither open quote nor close quote can be a substring of the delimiter.
- The delimiter cannot be a substring of either open quote or close quote.
- The backslash character (\) cannot occur in either the open quote or close quote.
Rules For Parsing The Input Line
Parsing the input line is relatively straightforward:
- If values are unquoted, scan for the delimiter or end-of-line, since those are the only significant characters.
-
If values are always quoted, the following rules apply:
- At the start of the input line, or after a delimiter, an open quote must be present (otherwise, it’s a malformed input line).
-
Following an open quote, all characters become part of the data value, with the following exceptions:
- A doubled close quote causes one close quote to become part of the data value.
- A backslash-escaped close quote causes the escape backslash to be discarded and the close quote to become part of the data value.
- A backslash-escaped backslash causes the escape backslash to be discarded and the second backslash to become part of the data value.
- An undoubled, unescaped close quote terminates the data value, and must be immediately followed by a delimiter or end-of-line.
-
If values are optionally quoted, the following rules apply:
- At the start of the input line, or after a delimiter, if an open quote is present, the value is quoted and the rules above for always-quoted values apply.
- Otherwise, the value is unquoted, and the rules above for unquoted values apply.
Examples of Quoted Data
Some typical five-field input lines with quoted values and using the default open/close quote might look like:
"abc"|"def"|"g|i"|"jkl"|"mno"|
"123"|"456"|"|||"|"pqr"|"xyz"
A typical five-field input line with quoted values using “sexed” quotes (“ and ”, for open and close quote, respectively) might look like:
“Smith”,“Jane”,“Dec 25, 1980”,“F”,“PhD”
A typical five-field input line with quoted values using distinct open (<#) and close (#>) quotes and comma as the delimiter might look like:
<#Smith#>,<#Jane#>,<#Dec 25, 1980#>,<#F#>,<#PhD#>
The previous examples, changed to only quote values when necessary (i.e., only when the value contains the delimiter) might look like:
abc|def|"g|i"|jkl|mno|
123|456|"|||"|pqr|xyz
Smith,Jane,“Dec 25, 1980”,F,PhD
Smith,Jane,<#Dec 25, 1980#>,F,PhD
Empty values
An unquoted value is empty:
- When a delimiter occurs at the start of the line, then the first field is empty.
- When two delimiters are adjacent to each other, then the field corresponding to that relative position is empty.
- When the delimiter following the last value is omitted, and there are no characters between the preceding delimiter and end-of-line, then the last field is empty.
In each of these input lines, the first, third, and fifth fields are empty:
|abc||xyz||
|123||456|
A quoted value is empty:
- When the open quote is immediately followed by the close quote.
In this input line, the second and fourth fields are empty:
"abc"|""|"ghi"|""|"mno"
The user can specify how empty data values are to be handled:
- They can be assigned NULL (the default, for backward compatibility).
- They can be set to zero-length VARCHARs.
Escapes
When dealing with quoted values, it may be necessary to include the close quote as part of the data value. Two escape mechanisms are provided for this:
-
Doubling:
- If the open quote and close quote are the same, and a quote is needed as part of the data value, the quote is repeated (doubled) at the location where a quote is needed in the value. A single occurrence of the quote is included in the data value for each doubled quote.
- If the open quote and close quote are distinct, and a close quote is needed as part of the data value, the close quote is repeated (doubled) at the location where a close quote is needed in the value. A single occurrence of the close quote is included in the data value for each doubled close quote. Note that no doubling is necessary to include the distinct open quote in the value.
-
Backslash escape:
- If the open quote and close quote are the same, and a quote is needed as part of the data value, the quote is preceded by backslash (\) at the location where a quote is needed in the value. Only the quote becomes part of the value; the backslash escape is discarded.
- If the open quote and close quote are distinct, and a close quote is needed as part of the data value, the close quote is preceded by backslash (\) at the location where a close quote is needed in the value. Only the close quote becomes part of the value; the backslash escape is discarded. Note that no backslash escape is necessary to include the distinct open quote in the value.
- If a backslash is needed as part of the data value, it is doubled. That is, \\ as part of the quoted input data value becomes a single backslash in the resultant value.
Neither escape mechanism has any meaning when a data value is unquoted. For quoted data values, the user may use either or both escape mechanisms (note that, regardless of the escape mechanism used, backslashes must be doubled).
Some examples:
"ab\"c"|"\"def"|"ghi\""|
results in three fields with values ab"c, "def, and ghi".
"ab""c"|"""def"|"ghi"""|
results in three fields with values ab"c, "def, and ghi".
Hi, we recently get the tpt14.0 installed and testing out the features of quoted delimited file. However, I am having trouble to deal with "escape" feature by "doubling".
The data looks like "good ""job"|20333
the tpt always throw error
Delimiter did not immediately follow close quote mark in row xxxxxxx, col x
I am trying to figure out what flags needs to be changed in the data connector . So far, no success. Can you please elaborate how can I enable the "doubling" escape mechanism for data connector?
Here is the data connector's definition
DEFINE OPERATOR FILE_READER
TYPE DATACONNECTOR PRODUCER
SCHEMA W_0_s_OTHER_PAYER_1
ATTRIBUTES
(
VARCHAR FileName = 'db2_export.del',
VARCHAR Format = 'DELIMITED',
VARCHAR QuotedData = 'Optional',
VARCHAR TextDelimiter = '|',
VARCHAR OpenQuoteMark = '"',
VARCHAR CloseQuoteMark = '"',
VARCHAR OpenMode = 'Read',
VARCHAR TrimColumns = 'Both',
VARCHAR NullColumns = 'N',
VARCHAR RowErrFileName = 'db2export.del.err'
);