# Column compress values from statistics

Besides collecting statistics on your columns on your Teradata database, the compressing of the data to save disk space is a very important maintaining task. So why not connect these two tasks? The idea is to extract the values for the multi value compression of the columns out of the collected statistics.

### The idea

Starting with Teradata V14 the "SHOW STATISTICS VALUES COLUMN col ON db.tab; " prints out as a text (optionally as XML) the results of the last collection of statistics in detail. The output in text form is exactly the command to insert the results of the collection back into the database. The command prints a lot of lines. The following are interesting for the algorithm:

...

/* NumOfNulls */ 20,

...

...

/* NumOfRows */ 3180,

...

...

/** Biased: Value, Frequency **/

/* 1 */ 'N', 3147,

/* 2 */ 'Y', 13

...

/* 1 */ 'N', 3147,

/* 2 */ 'Y', 13

...

Specially the biased values block show the values of the column, which are very often in the data. And these values can be taken for compressing of the column.

The column for compression has to have the following requirements:

- Statistic has to be representative and actual, but could be sampled
- Column is not allowed to be part of index or partition
- The statistics values must have the correct length
- It is not allowed to have statistics on the column during the alter table statement

In Teradata 14 all statistics values are limited to 26 characters. To get the not trimmed values you have to use the "USING MAXVALUELENGTH" clause during the collect statistics command.

The other fact disturb the algorithm more: You cannot change a column when there is an statistic on it.

The advantages are:

#### First Results and Motivation

As a teradata customer we run a Appliance instance with about 10 TB of user data. In a few hours running these scripts we decreased our space by 20%.

Unfortunately this is the only instance I can test the scripts at the moment, so further improvements and remarks are very welcomed.

Last, but not least, thanks to Dieter Nöth (dnoeth) for the tipps.

Discussion

This easy solution for fitting on one page has some disadvantages:

## The algorithm

First we execute for each column with statistics of the table to compress the "SHOW STATISTICS VALUES COLUMN". From this output we take the numbers of null and the values of the biased values block. From the number of occurences we decide which values come into the multi value compress list. At the moment each value has to have an estimation of more than 1% in the data. With this limit it could not happen that we have more than 100 compress values. In parallel we create a "DROP STATISTICS" and the "COLLECT STATISTICS COLUMN ... ON ... VALUES (...);" to put the statistics back. With this three files we first drop the statistics, perform the alter table statement and after that put the statistics back.

## The process

The algorithm consists of a sql file and an awk script. The sql file gets the "SHOW STATISTICS VALUES COLUMN" for the columns for the tables in an useful ordering:

These commands have to be executed by bteq and stored in one file. The awk script takes this file and produces a larger file:

Executing these statements perform the compression. Finished.

## The source code

SQL File

AWK File