extensibility
BYOM Extended Universe
Version: 3.0.1 - Created: 16 Sep 2025
BYOM Extended UniverseThis package contains a collection of Teradata functions complementary to Teradata BYOM (Bring Your Own Model). These functions do not replace BYOM, instead, they help you work more efficiently with Small Language Models directly in the database. 🚀 Functions Included 1. ArgMaxA table operator that extracts the index and value of the largest element from a vector embedded in table columns.Use case: Get the predicted class and confidence score from classification models outputting a probability vector.Inputs:Table with vector columns named like emb_0, emb_1, ..., all of type FLOAT.Outputs:All input columns, plus:arg_max_index: Index of the highest value in the vector.arg_max_value: The corresponding value.Parameters:VectorColumnsPrefix (STRING): Prefix for vector columns (e.g., 'emb_').VectorColumnsNumber (INTEGER): Number of vector columns.Example:SELECT * FROM byom_extended_universe.ArgMax( ON sasha.complaints_sentiment USING VectorColumnsPrefix('emb_'), VectorColumnsNumber(2)) AS a; 2. SoftMaxTransforms a vector of raw scores into a probability distribution (values sum to 1). Useful for making classification outputs more interpretable.Inputs:Table with raw prediction vector columns.Outputs:All original columns, with vector columns replaced by their SoftMax-transformed equivalents.Parameters:VectorColumnsPrefix (STRING): Prefix for vector columns.VectorColumnsNumber (INTEGER): Number of vector columns.Example:SELECT * FROM byom_extended_universe.SoftMax( ON sasha.complaints_sentiment USING VectorColumnsPrefix('emb_'), VectorColumnsNumber(2)) AS a; 3. LengthInTokensA table operator that calculates the length in tokens for a text field. Especially handy for LLM/SLM input prep.Inputs:Data table with a txt column (text to process).Tokenizer table: One row, one column "tokenizer" of type BLOB, using the DIMENSION keyword. The BLOB should contain the contents of the tokenizer.json file from the desired model. This is the same format as in BYOM’s ONNXEmbeddings.Outputs:All input columns.Additional column: length_in_tokens (INTEGER) — number of tokens generated using the provided tokenizer on the txt field.Parameters: (none)Example:SELECT * FROM byom_extended_universe.LengthInTokens ( ON (SELECT id, txt FROM complaints.complaints_clean) ON (SELECT tokenizer FROM embeddings_tokenizers WHERE model_id = 'bge-small-en-v1.5') DIMENSION) a; 4. ChunkTextText chunking is critical for working with language models—breaking large texts into model-friendly pieces. ChunkText is a tokenizer-aware chunker: splits text so no chunk exceeds a specified token limit, using your provided tokenizer.Inputs:Data table with a txt column (VARCHAR or CLOB; text to chunk). Use CLOB type to process bigger texts.Tokenizer table: One row, one column "tokenizer" of type BLOB, using the DIMENSION keyword. The BLOB should contain the contents of the tokenizer.json file from the desired model, as used in BYOM’s ONNXEmbeddings.Outputs:All input columns except txt.chunk_number (INTEGER): chunk index, starts at 0.txt (VARCHAR): text of the chunk, always Unicode.chunk_length_in_tokens (INTEGER): length of the chunk in tokens.Parameters:MaxTokensType: INTEGERDefault: (req)Description: Maximum tokens per chunk (must be > 2).MaxOverlapWordsType: INTEGERDefault: 0Description: Words to carry over from previous chunk (semantic overlap; 0 = no overlap).FirstNChunksType: INTEGERDefault: 0Description: Output only the first N chunks (0 = all chunks).OutputVarcharLengthType: INTEGERDefault: 6000Description: Length of output txt VARCHAR (1–32000 allowed).SplittingStrategyType: STRINGDefault: 'WORDS'Description: How to split the text (see below for strategies).Splitting Strategy (parameter SplittingStrategy):WORDS (default): Splits the text by words. Chunks are created by grouping words so the total token count stays under MaxTokens.SENTENCES: Splits by sentences using language-appropriate sentence boundaries. Chunks consist of full sentences, grouped as long as the token limit allows.PARAGRAPHS: Splits by paragraphs, each chunk aiming to be one or more full paragraphs under the token limit.Fallback logic: If a text unit (paragraph, sentence, or word) exceeds the MaxTokens limit, the function automatically falls back to a finer splitting strategy (e.g., from PARAGRAPH → SENTENCE → WORD) for that chunk.If a single word is still too long, it is placed in its own chunk—even if it exceeds MaxTokens.Chunks may overlap by up to MaxOverlapWords to preserve semantic context (especially useful for RAG or embeddings).Internally, splitting uses regular expression rules for each unit type.Example:SELECT * FROMbyom_extended_universe.ChunkText( ON (SELECT id, txt FROM complaints.complaints_clean) ON (SELECT tokenizer FROM embeddings_tokenizers WHERE model_id = 'bge-small-en-v1.5') DIMENSION USING MaxTokens(25) MaxOverlapWords(3) FirstNChunks(1) SplittingStrategy('WORDS') OutputVarcharLength(32000)) a; 5. NerReplace NerReplace is a production-ready tool for finding and replacing sensitive information (PII) directly inside Teradata:Remove PII before sharing data—replace names, addresses, account numbers, and more without moving data outside the database.Stay compliant and secure—all processing happens in-database, minimizing the risk of exposing private information.Seamless integration—makes it safe to use third-party analytics or machine learning tools without compromising privacy.Customizable, auditable, and fast—NerReplace brings privacy-first data workflows directly to your Teradata environment.This is a Table Operator. This function is supposed to operate on an output of the ONNXEmbeddings function.Inputs:Table with input data. Required columns are:txt - the input text (could be VARCHAR or CLOB)logits - the output of model executed with ONNXEmbeddings (BLOB or VARBYTE)and any other columnsOne-line, one-column table with tokenizer.Should be one record table and one column named tokenizer.This column should be BLOB datatype.This should be the contents of tokenizer.json from HuggingFace for the model used in ONNXEmbeddings.The same as third table input in ONNXEmbeddings.Should go with the DIMENSION keyword.One-line, one-column table with model config.Should be one record table and one column named config.This column should be BLOB datatype.This should be the contents of config.json from HuggingFace for the model used in ONNXEmbeddings.Should go with the DIMENSION keyword.Outputs:All the columns from input table. In column txt, the original text is replaced with processed text.Column logits is copied only if KeepLogits parameter is true.Column replaced_entities - the column of datatype VARCHAR with entities details. Only appears if OutputDetails is 'true'.Parameters:OutputDetailsRequired: OptionalType/Allowed Values: "true" / "false"Default: falseDescription: If "true", adds a column with details (JSON) about each replaced entity: begin, end, score, text, and label.Example: {"begin":109,"end":130,"text":"Quantum Analytics LLC","entity":"COMPANYNAME","score":0.9395}EntitiesToReplaceRequired: OptionalType/Allowed Values: List of entity labels (e.g. 'EMAIL', 'SSN', …)Default: allDescription: Restrict replacements to these entity labels. With NONE aggregation, BIO prefixes are present (e.g. B-EMAIL). Raises error if label doesn’t exist for model. Column name: replaced_entities.AggregationStrategyRequired: OptionalType/Allowed Values: "NONE", "SIMPLE", "AVERAGE", "FIRST", "MAX"Default: SIMPLEDescription: How tokens are grouped into entities:NONE: per-tokenSIMPLE: group by wordFIRST: use first tokenAVERAGE: average logitsMAX: max logit(AVERAGE, FIRST, MAX require a word-aware tokenizer.)ReplaceWithEntityNameRequired: OptionalType/Allowed Values: "true" / "false"Default: trueDescription: If "true", replaces entities with their label (e.g. 'EMAIL'). If "false", uses ReplacementText parameter.ReplacementTextRequired: OptionalType/Allowed Values: Any stringDefault: (none)Description: Text to replace entities with (e.g. "[REDACTED]"). Must set ReplaceWithEntityName='false' to use.KeepLogitsRequired: OptionalType/Allowed Values: "true" / "false"Default: falseDescription: If "true", copies the logits column to output.ReplacedEntiyInfoLengthRequired: OptionalType/Allowed Values: Integer (1–32000)Default: 6000Description: Length limit (characters) for the replaced_entities column (JSON). Only applies if OutputDetails='true'.Aggregation Strategies (parameter AggregationStrategy):NONE: No grouping—each token is an entity.SIMPLE: Groups tokens into words using tokenizer's word-ids, assigns entity by max score.FIRST: Entity label from the first token of each word. (Requires word-aware tokenizer)AVERAGE: Entity label by averaging logits for all tokens in a word, then softmax. (Requires word-aware tokenizer)MAX: Entity label by maximum logit over all tokens in a word. (Requires word-aware tokenizer)Example:SELECT *FROM byom_extended_universe.NerReplace( ON (SELECT id, txt AS orig_txt, txt, logits FROM sasha.ner_input_distilbert_finetuned_ai4privacy_v2) ON (SELECT model AS tokenizer FROM sasha.ner_tokenizers WHERE model_id = 'distilbert_finetuned_ai4privacy_v2') DIMENSION ON (SELECT model AS config FROM sasha.ner_model_configurations WHERE model_id = 'distilbert_finetuned_ai4privacy_v2') DIMENSION USING AggregationStrategy('AVERAGE') OutputDetails('True') EntitiesToReplace('FIRSTNAME', 'LASTNAME', 'SSN', 'DOB')) a; 🛠️ InstallationPre-requisite: BYOM v.6.0 or newer must be installed on your Teradata system. Installation steps:Create the database and grant permissions:CREATE DATABASE byom_extended_universe AS PERM = <50000000 * NUMBER OF AMPs IN A SYSTEM>;GRANT CREATE EXTERNAL PROCEDURE ON byom_extended_universe TO dbc;GRANT CREATE FUNCTION ON byom_extended_universe TO dbc;Switch to the database and install the JAR file:DATABASE byom_extended_universe; CALL SQLJ.INSTALL_JAR('cj!<PATH TO JAR FILE>', 'BYOM_EU', 0);Create the functions:REPLACE FUNCTION byom_extended_universe.ArgMax() RETURNS TABLE VARYING USING FUNCTION ArgMax_contract LANGUAGE JAVA NO SQL PARAMETER STYLE SQLTable EXTERNAL NAME 'BYOM_EU:com.teradata.byom.extended.universe.vector.ops.ArgMax.execute()';REPLACE FUNCTION byom_extended_universe.SoftMax() RETURNS TABLE VARYING USING FUNCTION SoftMax_contract LANGUAGE JAVA NO SQL PARAMETER STYLE SQLTable EXTERNAL NAME 'BYOM_EU:com.teradata.byom.extended.universe.vector.ops.SoftMax.execute()';REPLACE FUNCTION byom_extended_universe.LengthInTokens() RETURNS TABLE VARYING USING FUNCTION LengthInTokens_contract LANGUAGE JAVA NO SQL PARAMETER STYLE SQLTable EXTERNAL NAME 'BYOM_EU:com.teradata.byom.extended.universe.nlp.utils.LengthInTokens.execute()';REPLACE FUNCTION byom_extended_universe.ChunkText() RETURNS TABLE VARYING USING FUNCTION ChunkerTO_contract LANGUAGE JAVA NO SQL PARAMETER STYLE SQLTable EXTERNAL NAME 'BYOM_EU:com.teradata.byom.extended.universe.chunking.ChunkerTO.execute()';REPLACE FUNCTION byom_extended_universe.NerReplace()RETURNS TABLE VARYING USING FUNCTION ReplaceNerTO_contractLANGUAGE JAVANO SQLPARAMETER STYLE SQLTableEXTERNAL NAME 'BYOM_EU:com.teradata.byom.extended.universe.ner.ReplaceNerTO.execute()'; Grant execution rights:GRANT EXECUTE FUNCTION ON byom_extended_universe TO <DESIRED USER/ROLE>;
Teradata 14.10 XML Data Type
Version: 13.00.00.00 - Created: 15 Jul 2015
The XML Data type, introduced in Teradata 14.10, provides the following new capabilities: •A new XML data type that allows you to store XML content in a compact binary form that preserves the information set of the XML document, including the hierarchy information and type information derived from XML validation •Functions and methods on the XML type that support common XML operations like parsing, validation, transformation (XSLT) and Query (XPath and XQuery) •Supports the XQuery query language for querying and transforming XML content •Supports the following SQL/XML functions and stored procedures: •XMLDOCUMENT •XMLELEMENT •XMLFOREST •XMLCONCAT •XMLCOMMENT •XMLPI •XMLTEXT •XMLAGG •XMLPARSE •XMLVALIDATE •XMLQUERY •XMLSERIALIZE •XMLTABLE •AS_SHRED_BATCH •XMLPUBLISH •XMLPUBLISH_STREAM Benefits •Removes the requirement to map between hierarchical and relational models prior to storing the XML contents. Previously, this was required when using the XML shredding and publishing functionality in Teradata XML Services. •Provides the ability to preserve document identity. In contrast, the shredding facility that Teradata XML Services supports only extracts the values out of the XML document without retaining the document identity. •Provides a compact method for storing XML content where the internal representation is 5~10 times smaller than the original text representation. •Provides standard query language support for querying XML using XQuery 1.0 and XPath 2.0. •Provides a more efficient method for parsing, transforming and querying XML content. •Integrates the functionality in the Teradata XML Services offering into Teradata Database using syntax that conforms to the ANSI SQL/XML specification. Considerations •The XML type accommodates values up to 2GB in size. However, operations like XSLT and XQuery, which load documents into memory, are only supported on documents that are smaller in size, where the processing operation does not require more memory than specified by the XML_MemoryLimit DBS Control field. This field caps the amount of memory available for XML operations so that these operations do not strain memory resources. A pseudo-streaming version of XQuery is allowed on large documents via the XMLEXTRACT method on the XML type. •XML path indexes are not supported. •XML schemas and XSLT stylesheets are not managed by Teradata Database. Schema and Stylesheet Consolidation Utility The Schema and Stylesheet consolidation utility is available for download as a zip file on this page. There is one significant change in this utility compared to the user documentation. The user documentation speaks of two executables ConsolidateSchema.exe and ConsolidateStylesheet.exe. In this version of the utility the two have been combined into a single executable csldgen.exe. The Readme.txt file included in the zip file gives more details regarding the need for schema and stylesheet consolidation and how to perform such consolidation. This utility is only relevant if your schemas and stylesheets use imports and includes.
Teradata XML Services
Version: 13.00.00.00 - Created: 15 Oct 2014
Teradata XML Services provide assistance in database transformation of XML structures to and from relational structures. This is primarily an enterprise fit feature. XML in this context is regarded as a data format that is used to describe incoming or outgoing warehouse data. A key concept for this feature is that we are not transforming to store XML but rather to maintain a relational data model or to integrate relational data into an enterprise XML message structure! The relational data model is bested suited for enterprise analytics. XML structures are best suited for enterprise integration. Teradata XML Services is supported on Database Versions 13, 13.10 and 14. As of Teradata Database version 14.10, much of the XML Services functionality has been implemented as part of the XML data type in the database, and Teradata XML Services as a separate download will not be supported for 14.10 and future versions. Mappings created for XML shredding and publishing can be used for 14.10 as well (except for XSLT shredding), but the names of the stored procedures will change. Please see the Teradata XML book in the database user documentation for further details. This feature will be delivered asynchronously from any specific Teradata warehouse release. The delivery format will be as a web download, available for each Teradata server platform. The feature is considered a part of the Teradata product and will be supported through normal support channels. Teradata XML Services consists of the following components: Xerces XML parser and Xalan XSLT transformer packaged as a platform specific operating system library. Shredding Framework which consists of a combination of stored procedures and functions. A stored procedure controls the shredding process. When shredding one to a few documents the stored procedure directly invokes the data maintenance DML. When shredding many documents, the stored procedure uses a set-based approach through the invocation of a table generation function. Parallel Publishing Framework which consists of a combination of stored procedures and functions. A stored procedure will control the publishing process. The stored procedure can either return a string representing an XML object type or a SQL statement that represents the XML data stream. The SQL statement can be reused in views, macros, fast export, etc. General purpose XSLT transformation function. Two XPATH search functions, one that returns a scalar character value and one that returns an XML fragment character value. XML schema validation function. XML Schema and Stylesheet loading and dependency resolution. XML schema generation procedures. Perl based installation process. For community-based support and to share your implementation ideas and concerns, please visit the Extensibility forum.
Block Level Compression Evaluation Utility
Version: 13.10.00.00 - Created: 13 Feb 2012
Teradata 13.10 features Block Level Compression (BLC), which provides the capability to perform compression on whole data blocks at the file system level before the data blocks are actually written to storage. Like any compression features, BLC helps save space and reduce I/O. This BLC utility is for Teradata users to run against TD 13.10 system to select the list of BLC candidate tables and evaluate BLC impact on space and speed for each specific table in interest, to get information for selecting appropriate tables to apply BLC. To learn how to use the BLC Utility package, please see the article Block Level Compression evaluation with the BLC utility. For community support for this package, please visit the Extensibility Forum.
Algorithmic Compression Test Package
Version: 1.0.0.2 - Created: 12 Nov 2010
The ALC (ALgorithmic Compression) test package contains UDFs simulating TD13.10 built-in compression functions, test templates for Latin and Unicode character columns and step-by-step instructions. It is intended for TD users to run over specific data at column level to determine compression rates of TD 13.10 built-in compression algorithms. The test results provide information for selecting an appropriate algorithm for specific data. These tests use read-only operations and they can be executed on any release that supports UDFs (V2R6.2 & forward). It is recommended to run these tests off peak hours - they will use a significant amount of system resources (CPU bound). Usage To learn how to install and use the test package, please see Selecting an ALC compression algorithm. For community support for this package, please visit the Extensibility forum.