Downloads

BYOM Extended Universe

Details

BYOM Extended Universe

This package contains a collection of Teradata functions complementary to Teradata BYOM (Bring Your Own Model).
These functions do not replace BYOM, instead, they help you work more efficiently with Small Language Models directly in the database.

🚀 Functions Included

1. ArgMax

A table operator that extracts the index and value of the largest element from a vector embedded in table columns.

Use case: Get the predicted class and confidence score from classification models outputting a probability vector.

Inputs:

Table with vector columns named like emb_0, emb_1, ..., all of type FLOAT.

Outputs:

All input columns, plus:
- arg_max_index: Index of the highest value in the vector.
- arg_max_value: The corresponding value.

Parameters:

VectorColumnsPrefix (STRING): Prefix for vector columns (e.g., 'emb_').
VectorColumnsNumber (INTEGER): Number of vector columns.

Example:

SELECT * FROM byom_extended_universe.ArgMax(
ON sasha.complaints_sentiment
USING
VectorColumnsPrefix('emb_'),
VectorColumnsNumber(2)
) AS a;

2. SoftMax

Transforms a vector of raw scores into a probability distribution (values sum to 1). Useful for making classification outputs more interpretable.

Inputs:

Table with raw prediction vector columns.

Outputs:

All original columns, with vector columns replaced by their SoftMax-transformed equivalents.

Parameters:

VectorColumnsPrefix (STRING): Prefix for vector columns.
VectorColumnsNumber (INTEGER): Number of vector columns.

Example:

SELECT * FROM byom_extended_universe.SoftMax(
ON sasha.complaints_sentiment
USING
VectorColumnsPrefix('emb_'),
VectorColumnsNumber(2)
) AS a;

3. LengthInTokens

A table operator that calculates the length in tokens for a text field. Especially handy for LLM/SLM input prep.

Inputs:

Data table with a txt column (text to process).
Tokenizer table: One row, one column "tokenizer" of type BLOB, using the DIMENSION keyword. The BLOB should contain the contents of the tokenizer.json file from the desired model. This is the same format as in BYOM’s ONNXEmbeddings.

Outputs:

All input columns.
Additional column: length_in_tokens (INTEGER) — number of tokens generated using the provided tokenizer on the txt field.

Parameters: (none)

Example:

SELECT * FROM byom_extended_universe.LengthInTokens (
ON (SELECT id, txt FROM complaints.complaints_clean)
ON (SELECT tokenizer FROM embeddings_tokenizers WHERE model_id = 'bge-small-en-v1.5') DIMENSION
) a;

4. ChunkText

Text chunking is critical for working with language models—breaking large texts into model-friendly pieces. ChunkText is a tokenizer-aware chunker: splits text so no chunk exceeds a specified token limit, using your provided tokenizer.

Inputs:

Data table with a txt column (VARCHAR or CLOB; text to chunk). Use CLOB type to process bigger texts.
Tokenizer table: One row, one column "tokenizer" of type BLOB, using the DIMENSION keyword. The BLOB should contain the contents of the tokenizer.json file from the desired model, as used in BYOM’s ONNXEmbeddings.

Outputs:

All input columns except txt.
chunk_number (INTEGER): chunk index, starts at 0.
txt (VARCHAR): text of the chunk, always Unicode.
chunk_length_in_tokens (INTEGER): length of the chunk in tokens.

Parameters:

MaxTokens
- Type: INTEGER
- Default: (req)
- Description: Maximum tokens per chunk (must be > 2).
MaxOverlapWords
- Type: INTEGER
- Default: 0
- Description: Words to carry over from previous chunk (semantic overlap; 0 = no overlap).
FirstNChunks
- Type: INTEGER
- Default: 0
- Description: Output only the first N chunks (0 = all chunks).
OutputVarcharLength
- Type: INTEGER
- Default: 6000
- Description: Length of output txt VARCHAR (1–32000 allowed).
SplittingStrategy
- Type: STRING
- Default: 'WORDS'
- Description: How to split the text (see below for strategies).

Splitting Strategy (parameter `SplittingStrategy`):

WORDS (default): Splits the text by words. Chunks are created by grouping words so the total token count stays under MaxTokens.
SENTENCES: Splits by sentences using language-appropriate sentence boundaries. Chunks consist of full sentences, grouped as long as the token limit allows.
PARAGRAPHS: Splits by paragraphs, each chunk aiming to be one or more full paragraphs under the token limit.
Fallback logic: If a text unit (paragraph, sentence, or word) exceeds the MaxTokens limit, the function automatically falls back to a finer splitting strategy (e.g., from PARAGRAPH → SENTENCE → WORD) for that chunk.
If a single word is still too long, it is placed in its own chunk—even if it exceeds MaxTokens.
Chunks may overlap by up to MaxOverlapWords to preserve semantic context (especially useful for RAG or embeddings).

Internally, splitting uses regular expression rules for each unit type.

Example:

SELECT * FROM
byom_extended_universe.ChunkText(
ON (SELECT id, txt FROM complaints.complaints_clean)
ON (SELECT tokenizer FROM embeddings_tokenizers WHERE model_id = 'bge-small-en-v1.5') DIMENSION
USING
MaxTokens(25)
MaxOverlapWords(3)
FirstNChunks(1)
SplittingStrategy('WORDS')
OutputVarcharLength(32000)
) a;

5. NerReplace

NerReplace is a production-ready tool for finding and replacing sensitive information (PII) directly inside Teradata:

Remove PII before sharing data—replace names, addresses, account numbers, and more without moving data outside the database.
Stay compliant and secure—all processing happens in-database, minimizing the risk of exposing private information.
Seamless integration—makes it safe to use third-party analytics or machine learning tools without compromising privacy.

Customizable, auditable, and fast—NerReplace brings privacy-first data workflows directly to your Teradata environment.

This is a Table Operator. This function is supposed to operate on an output of the ONNXEmbeddings function.

Inputs:

Table with input data. Required columns are:
- txt - the input text (could be VARCHAR or CLOB)
- logits - the output of model executed with ONNXEmbeddings (BLOB or VARBYTE)
- and any other columns
One-line, one-column table with tokenizer.
Should be one record table and one column named tokenizer.
This column should be BLOB datatype.
This should be the contents of tokenizer.json from HuggingFace for the model used in ONNXEmbeddings.
The same as third table input in ONNXEmbeddings.
Should go with the DIMENSION keyword.
One-line, one-column table with model config.
Should be one record table and one column named config.
This column should be BLOB datatype.
This should be the contents of config.json from HuggingFace for the model used in ONNXEmbeddings.
Should go with the DIMENSION keyword.

Outputs:

All the columns from input table. In column txt, the original text is replaced with processed text.
Column logits is copied only if KeepLogits parameter is true.
Column replaced_entities - the column of datatype VARCHAR with entities details. Only appears if OutputDetails is 'true'.

Parameters:

OutputDetails
- Required: Optional
- Type/Allowed Values: "true" / "false"
- Default: false
- Description: If "true", adds a column with details (JSON) about each replaced entity: begin, end, score, text, and label.
  Example: {"begin":109,"end":130,"text":"Quantum Analytics LLC","entity":"COMPANYNAME","score":0.9395}
EntitiesToReplace
- Required: Optional
- Type/Allowed Values: List of entity labels (e.g. 'EMAIL', 'SSN', …)
- Default: all
- Description: Restrict replacements to these entity labels. With NONE aggregation, BIO prefixes are present (e.g. B-EMAIL). Raises error if label doesn’t exist for model. Column name: replaced_entities.
AggregationStrategy
- Required: Optional
- Type/Allowed Values: "NONE", "SIMPLE", "AVERAGE", "FIRST", "MAX"
- Default: SIMPLE
- Description: How tokens are grouped into entities:
  - NONE: per-token
  - SIMPLE: group by word
  - FIRST: use first token
  - AVERAGE: average logits
  - MAX: max logit
    (AVERAGE, FIRST, MAX require a word-aware tokenizer.)
ReplaceWithEntityName
- Required: Optional
- Type/Allowed Values: "true" / "false"
- Default: true
- Description: If "true", replaces entities with their label (e.g. 'EMAIL'). If "false", uses ReplacementText parameter.
ReplacementText
- Required: Optional
- Type/Allowed Values: Any string
- Default: (none)
- Description: Text to replace entities with (e.g. "[REDACTED]"). Must set ReplaceWithEntityName='false' to use.
KeepLogits
- Required: Optional
- Type/Allowed Values: "true" / "false"
- Default: false
- Description: If "true", copies the logits column to output.
ReplacedEntiyInfoLength
- Required: Optional
- Type/Allowed Values: Integer (1–32000)
- Default: 6000
- Description: Length limit (characters) for the replaced_entities column (JSON). Only applies if OutputDetails='true'.

Aggregation Strategies (parameter `AggregationStrategy`):

NONE: No grouping—each token is an entity.
SIMPLE: Groups tokens into words using tokenizer's word-ids, assigns entity by max score.
FIRST: Entity label from the first token of each word. (Requires word-aware tokenizer)
AVERAGE: Entity label by averaging logits for all tokens in a word, then softmax. (Requires word-aware tokenizer)
MAX: Entity label by maximum logit over all tokens in a word. (Requires word-aware tokenizer)

Example:

SELECT
*
FROM byom_extended_universe.NerReplace(
ON (SELECT id, txt AS orig_txt, txt, logits FROM sasha.ner_input_distilbert_finetuned_ai4privacy_v2)
ON (SELECT model AS tokenizer FROM sasha.ner_tokenizers WHERE model_id = 'distilbert_finetuned_ai4privacy_v2') DIMENSION
ON (SELECT model AS config FROM sasha.ner_model_configurations WHERE model_id = 'distilbert_finetuned_ai4privacy_v2') DIMENSION
USING
AggregationStrategy('AVERAGE')
OutputDetails('True')
EntitiesToReplace('FIRSTNAME', 'LASTNAME', 'SSN', 'DOB')
) a;

🛠️ Installation

Pre-requisite: BYOM v.6.0 or newer must be installed on your Teradata system.

Installation steps:

Create the database and grant permissions:
CREATE DATABASE byom_extended_universe AS PERM = <50000000 * NUMBER OF AMPs IN A SYSTEM>;
GRANT CREATE EXTERNAL PROCEDURE ON byom_extended_universe TO dbc;
GRANT CREATE FUNCTION ON byom_extended_universe TO dbc;
Switch to the database and install the JAR file:
DATABASE byom_extended_universe;
CALL SQLJ.INSTALL_JAR('cj!<PATH TO JAR FILE>', 'BYOM_EU', 0);
Create the functions:
REPLACE FUNCTION byom_extended_universe.ArgMax()
RETURNS TABLE VARYING USING FUNCTION ArgMax_contract
LANGUAGE JAVA
NO SQL
PARAMETER STYLE SQLTable
EXTERNAL NAME 'BYOM_EU:com.teradata.byom.extended.universe.vector.ops.ArgMax.execute()';
REPLACE FUNCTION byom_extended_universe.SoftMax()
RETURNS TABLE VARYING USING FUNCTION SoftMax_contract
LANGUAGE JAVA
NO SQL
PARAMETER STYLE SQLTable
EXTERNAL NAME 'BYOM_EU:com.teradata.byom.extended.universe.vector.ops.SoftMax.execute()';
REPLACE FUNCTION byom_extended_universe.LengthInTokens()
RETURNS TABLE VARYING USING FUNCTION LengthInTokens_contract
LANGUAGE JAVA
NO SQL
PARAMETER STYLE SQLTable
EXTERNAL NAME 'BYOM_EU:com.teradata.byom.extended.universe.nlp.utils.LengthInTokens.execute()';
REPLACE FUNCTION byom_extended_universe.ChunkText()
RETURNS TABLE VARYING USING FUNCTION ChunkerTO_contract
LANGUAGE JAVA
NO SQL
PARAMETER STYLE SQLTable
EXTERNAL NAME 'BYOM_EU:com.teradata.byom.extended.universe.chunking.ChunkerTO.execute()';
REPLACE FUNCTION byom_extended_universe.NerReplace()
RETURNS TABLE VARYING USING FUNCTION ReplaceNerTO_contract
LANGUAGE JAVA
NO SQL
PARAMETER STYLE SQLTable
EXTERNAL NAME 'BYOM_EU:com.teradata.byom.extended.universe.ner.ReplaceNerTO.execute()';
Grant execution rights:
GRANT EXECUTE FUNCTION ON byom_extended_universe TO <DESIRED USER/ROLE>;

Download Teradata Vantage Express, a free, fully-functional Teradata Vantage database, that can be up and running on your system in minutes. Please download and read the user guide for installation instructions.

Note that in order to run this VM, you'll need to install VMware Workstation Player, VMware Fusion, VMware Server, VirtualBox, or UTM on your system. For more details, see our getting started guides.

For feedback, discussion, and community support, please visit the Cloud Computing forum.

Specifications

Version
Released
TTU
OS
Teradata

BYOM Extended Universe

BYOM Extended Universe

Details