All Forums Database
jana.teradata 11 posts Joined 02/10
05 Nov 2013
How Compression Works

Hi All,
 
I have a table with  billion records. I have quite a few char(01) columns in this table which has repeating values like 'Y' , 'N'. I have created a similar table  but with the compression added on all these char(01) columns. I found out the table with compressing all these columns increases the original space.
 
So my question is : is not recommnded to compress the char(01) columns. If so can you please explain why ?
 
Thanks in advance
Jana
 

M.Saeed Khurram 544 posts Joined 09/12
05 Nov 2013

Hi,
When you define compression on some columns in TD, Teradata stores the compressed values in the table header. So if you are compressing a large number of columns, the header size will become large. Thats why you might be getting the the table size larger than actual.
 

Khurram

VBurmist 96 posts Joined 12/09
06 Nov 2013

Hi,
You can get the size of the table header by creating an empty table.  Depends on the number of AMPs.
Apart from the table header, another important reason is the introduction of additional presence bits - in each row of the table.   The more compressed values you enlist, the more presence bits you need to compress those values.
- For those rows where values are compressed, we have to store only the presence bit(s), not the actual values, so the storage space is greatly reduced.   Bits are packed into bytes.   Works like a charm.
- However, if many rows have values that are not compressed, for those rows we store both presence bits and actual value (if not compressed).   This introduces a possibility of a size increase, but only when the compressed values are defined incorrectly.    So, either no compression takes place for many rows, or we spend too many presense bits for values we don't really need.
So, for each column you should verify whether a compressed value is really encountered in a large number of rows.    If a value is quite rare - encountered only in a small number of rows, it might be cheaper to omit it in the compression list.    
Char(1) are usually good candidates for compression.
Regards,
Vlad.

11 Feb 2014

Hi all,
Just want to confirm that as the number of values lead to addtion of presence bits which in turn increases to size of table header. 
Do we need to go for odd number of values or the even number of values per column for the good
space savings?
 
Regards,
Nishant
 

dnoeth 4628 posts Joined 11/04
11 Feb 2014

Hi Nishant,
it's odd, (2**n) -1 :-)

BITS     VALUES
1    ->   1
2    ->   2 - 3
3    ->   4 - 7
4    ->   8 - 15
5    ->  16 - 31
6    ->  32 - 63
7    ->  64 - 127
8    -> 128 - 255

 

Dieter

11 Feb 2014

Ok thanks dieter.
Also just wanted to know ,how much savings we get by compressing NULLS or zero length string (i.e '' ) ?
I believe NULLS or '' also  take   1 byte.
Please correct me if I am wrong?
 
Regards,
Nishant

vlcik 3 posts Joined 01/14
23 Feb 2016

Dieter, Nishant,

 

just to add the formula above differs for nullable and not nullable columns. Don't forget you need to save that extra state for "NULL" which is compressed implicitely for nullable columns. To describe further, I will work with the word STATE:

1) NULLABLE - NULL, (not compressed), ValueA, ValueB - 2 items on compression list, 4 states to be coded -> 2 presence bits required

2) NOT NULLABLE - (not compressed), ValueA, ValueB, ValueC - 3 items on compression list, 4 states to be coded -> 2 presence bits required

 

N presence bits can be used to code up 2^N states. Therefore you need n=[log(2)(k+2)] bits to code (k) nullable columns and n=[log(2)(k+1)] for not nullable...

 

Long story short - the ideal number of items on compression list is:

1) NULLABLE - (0),2,4,8,16... (0 means only compress)

2) NOT NULLABLE - 1,3,7,15...

 

If you need to go deeper, feel free to ask :)

 

Vlcik

dnoeth 4628 posts Joined 11/04
23 Feb 2016

Hi Vlcik,
nope, the NULL is already included :-)
See Number of Presence Bits Required to Represent Compressed Values in the Database Design manual:
http://www.info.teradata.com/HTMLPubs/DB_TTU_15_10/Database_Management/B035_1094_151K/Database_Level_Capacity_Planning.064.031.html#ww19308315

Dieter

vlcik 3 posts Joined 01/14
23 Feb 2016

Dieter,
thanks for the link. I see the information in the documentation contradictory though. As stated in one of the tables, it says:
col_1 CHAR(1) NOT NULL COMPRESS (‘A’) - requires 1 presence bit
col_1 CHAR(1) COMPRESS (‘A’) - requires 2 presence bits
 
Therefore NULL is not included, in fact nullability requires 1 extra state to be "coded" using presence bits and thus requires the extra presence bit in the case above. The only explanation would be that Teradata stores the information about nullability in a extra separate (presence?) bit, which would be very uneffective. The number of states to be represented by presence bits obviously differs - since compressing null values with COMPRESS is implicit and can't be skipped.
If I'm in fact wrong, can't you tell me why there is difference in the case above?

You must sign in to leave a comment.