In my previous posts on missing functions i covered SQL:2008 DENSE_RANK and CUME_DIST & NTILE.
This time i'll talk about additional functions based on sorted sets of values, SQL:2008's PERCENTILE_DISC and PERCENTILE_CONT plus a MEDIAN calculation.

PERCENTILE_DISC

The CUME_DIST function computes the cumulative distribution of a set of values. The inverse operation to find the value for a given percentile is done using PERCENTILE_DISC.

According to SQL:2008 PERCENTILE_DISC(x) is the first value with a CUME_DIST greater than or equal to x.
This directly translates to the row with a

ROW_NUMBER() OVER (PARTITION BY part_col ORDER BY order_col)
= CEILING(COUNT(*) OVER (PARTITION BY part_col) * x

There's no CEILING function prior to TD13.10, but this returns the same value:

CAST(COUNT(*) OVER (PARTITION BY part_col) * x + 0.999999999999999 AS INT)

Caution:
This calculation will not work for x = 0, but you better use good old MIN instead of PERCENTILE_DISC(0).

-- SQL:2008 PERCENTILE_DISC aggregate function
SELECT
  part_col
  ,PERCENTILE_DISC(x) WITHIN GROUP (ORDER BY data_col) --aggregate function
FROM tab
GROUP BY 
  part_col;

-- Teradata equivalent
SELECT 
  part_col
  ,data_col
FROM tab
QUALIFY  ROW_NUMBER() OVER (PARTITION BY part_col ORDER BY data_col)
  = CAST(x * COUNT(*) OVER (PARTITION BY part_col) + 0.999999999999999 AS INT);

PERCENTILE_DISC is also defined as OLAP function adding the OVER clause. This requires moving the QUALIFY-condition into a CASE and another nested OLAP function:

-- SQL:2008 PERCENTILE_DISC OLAP function
SELECT
  part_col
 ,data_col
 ,PERCENTILE_DISC(x) WITHIN GROUP (ORDER BY data_col) OVER (PARTITION BY part_col)
FROM tab

-- Teradata equivalent
SELECT
  part_col
  ,data_col
  ,MIN(pc) OVER (PARTITION BY part_col) AS PERCENTILE_DISC
FROM
 (
  SELECT
    part_col
    ,data_col
    ,CASE
      WHEN ROW_NUMBER () OVER (PARTITION BY part_col ORDER BY data_col)
         = CAST(COUNT(*) OVER (PARTITION BY part_col) * x + 0.999999999999999 AS INT)
      THEN data_col
    END AS pc
  FROM tab
 ) AS dt

PERCENTILE_CONT

Whereas PERCENTILE_DISC returns one of the existing values, PERCENTILE_CONT is based on a linear interpolation between two consecutive values. First you calculate a hypothetical row number rn = (x * (n - 1)) + 1, where x is the percentile and n is the number of rows per group. If rn has no fractional part then you already found the result row, else you have to consider the following row, too:

-- SQL:2008 PERCENTILE_CONT aggregate function
SELECT
  part_col
 ,PERCENTILE_CONT(x) WITHIN GROUP (ORDER BY order_col)
FROM tab
GROUP BY
  part_col;

-- Teradata equivalent
SELECT
  part_col
  ,data_col
   + ((MIN(data_col) OVER (PARTITION BY part_col ORDER BY data_col ROWS BETWEEN 1 FOLLOWING AND 1 FOLLOWING) - data_col)
       * (((COUNT(*) OVER (PARTITION BY part_col) - 1) * x) MOD 1)) AS percentile_cont
FROM tab
QUALIFY ROW_NUMBER() OVER (PARTITION BY part_col ORDER BY data_col)
    = CAST((COUNT(*) OVER (PARTITION BY part_col) - 1) * x AS INT) + 1;

-- SQL:2008 PERCENTILE_CONT OLAP function
SELECT
  part_col,
  data_col,
  PERCENTILE_CONT(x) WITHIN GROUP (ORDER BY data_col)
  OVER (PARTITION BY part_col)
FROM tab

-- Teradata equivalent
SELECT
  part_col
  ,data_col
  ,MIN(pc) OVER (PARTITION BY part_col) AS PERCENTILE_CONT
FROM
 (
  SELECT
    part_col
    ,data_col
    ,COUNT(*)      OVER (PARTITION BY part_col) - 1 AS N
    ,ROW_NUMBER()  OVER (PARTITION BY part_col ORDER BY data_col) - 1 AS rowno
    ,MIN(data_col) OVER (PARTITION BY part_col ORDER BY data_col ROWS BETWEEN 1 FOLLOWING AND 1 FOLLOWING) - data_col AS diff
    ,CASE
       WHEN rowno = CAST(N * x AS INT)
       THEN data_col +(((N * x) MOD 1) * diff)
     END AS pc
  FROM tab
 ) AS dt

The CASE technique can also be used to calculate multiple percentiles in a single query, e.g. the Quartiles as aggregate functions:

SELECT part_col
  ,MIN(pc25) OVER (PARTITION BY part_col) AS quartile_1
  ,MIN(pc50) OVER (PARTITION BY part_col) AS quartile_2
  ,MIN(pc75) OVER (PARTITION BY part_col) AS quartile_3
FROM
 (
  SELECT
    part_col
    ,COUNT(*)      OVER (PARTITION BY part_col) - 1 AS N
    ,ROW_NUMBER()  OVER (PARTITION BY part_col ORDER BY data_col) - 1 AS rowno
    ,MIN(data_col) OVER (PARTITION BY part_col ORDER BY data_col ROWS BETWEEN 1 FOLLOWING AND 1 FOLLOWING) - data_col AS diff

    ,CASE
       WHEN rowno = CAST(N * 0.25 AS INT)
       THEN data_col +(((N * 0.25) MOD 1) * diff)
     END AS pc25
 
    ,CASE
       WHEN rowno = CAST(N * 0.50 AS INT)
       THEN data_col +(((N * 0.50) MOD 1) * diff)
     END AS pc50
 
    ,CASE
       WHEN rowno = CAST(N * 0.75 AS INT)
       THEN data_col +(((N * 0.75) MOD 1) * diff)
     END AS pc75
  FROM tab
  QUALIFY   rowno = CAST(N * 0.25 AS INT)
       OR   rowno = CAST(N * 0.50 AS INT)
       OR   rowno = CAST(N * 0.75 AS INT)
 ) AS dt
QUALIFY ROW_NUMBER() OVER (PARTITION BY part_col ORDER BY part_col) = 1

Omitting both QUALIFY clauses will return the quartiles as OLAP functions.

MEDIAN

The MEDIAN is just the special case of the 50th percentile.
In SQL:2008 there's no reason to define an extra funcion for that, but above formula can be simplified for a median calculation:
If the number of rows N is odd, then the median value is found in row (N/2)+1 else it's the average of this row's and the previous row's value:

-- MEDIAN aggregate function
SELECT
 part_col
 ,MEDIAN(data_col)
FROM tab
GROUP BY part_col;

-- Teradata equivalent
SELECT
  part_col
  ,CASE
     WHEN      COUNT(*) OVER (PARTITION BY part_col) MOD 2 = 1 THEN data_col
     ELSE AVG(data_col) OVER (PARTITION BY part_col ORDER BY data_col ROWS 1 PRECEDING)
   END
FROM tab
QUALIFY
  ROW_NUMBER() OVER (PARTITION BY part_col ORDER BY data_col)
    = COUNT(*) OVER (PARTITION BY part_col) / 2 + 1;

-- MEDIAN OLAP function
SELECT
  part_col
  ,MEDIAN(data_col) OVER (PARTITION BY part_col)
FROM tab
GROUP BY part_col;

-- Teradata equivalent
SELECT
  part_col
  ,MIN(med) OVER (PARTITION BY part_col) AS "MEDIAN"
FROM
 (
  SELECT
    part_col
    ,CASE
       WHEN ROW_NUMBER() OVER (PARTITION BY part_col ORDER BY data_col)
              = COUNT(*) OVER (PARTITION BY part_col) / 2 + 1
       THEN
         CASE
           WHEN COUNT(*) OVER (PARTITION BY part_col) MOD 2 = 1 THEN data_col
           ELSE AVG(data_col) OVER (PARTITION BY part_col ORDER BY data_col ROWS 1 PRECEDING)
         END
     END AS med
  FROM tab
 ) AS dt

This MEDIAN calculation returns the same result as the correspondent function in Excel and PERCENTILE_CONT reproduces Excel's PERCENTILE function.

Discussion
DiEgoR 10 comments Joined 08/06
11 Jul 2011

A year or so ago I have been working on the same topic and came up with the following implementations. The mathematical definition used here must be a bit different from the SQL standart but perhaps it is worth for you to comment on.

/*** "financial median": Group COUNT + ROW_NUMBER ***/
SELECT
dt1.part_col
,AVG(data_col)
FROM
(SELECT
part_col
,data_col
,ROW_NUMBER() OVER (PARTITION BY part_col
ORDER BY data_col) AS row_num
,COUNT(*) OVER (PARTITION BY part_col
ROWS BETWEEN UNBOUNDED PRECEDING
AND UNBOUNDED FOLLOWING) AS row_count
FROM tab
) AS dt1
WHERE
row_num = (row_count + 1) / 2
OR
row_num = (row_count / 2) + 1
GROUP BY part_col
ORDER BY part_col
;

/*** "statistical median": Group COUNT + ROW_NUMBER ***/
SELECT
dt1.part_col
,data_col
FROM
(SELECT
part_col
,data_col
,ROW_NUMBER() OVER (PARTITION BY part_col
ORDER BY data_col) AS row_num
,COUNT(*) OVER (PARTITION BY part_col
ROWS BETWEEN UNBOUNDED PRECEDING
AND UNBOUNDED FOLLOWING) AS row_count
FROM tab
) AS dt1
WHERE
row_num = (row_count + 1) / 2 --left (lesser) value
-- row_num = (row_count / 2) + 1 --right (greater) value
ORDER BY part_col
;

input output putput

dnoeth 86 comments Joined 11/04
17 Jul 2011

Hi Egor,
i know those queries, i wrote them a few years ago :-)

The 1st is exactly the same as the SQL:2008 calculation, I only changed the AVG calculation from aggregate to OLAP.

Dieter

Dieter

Angellore 1 comment Joined 06/10
11 Aug 2011

Hi. Is it possible to write OLTP UDF in Teradata? Functions are good, but it is not convenient to use them...

dnoeth 86 comments Joined 11/04
11 Aug 2011

Did you mean OLAP UDF?
Yes, you can write your own OLAP functions (in C) in TD13.10, there's a DENSE_RANK example in the "SQL External Routine Programming" manual.

But it's probably quite hard to do so unless you are an experienced C-programmer and find a good algorithm to implement...

Dieter

Dieter

kattamadhu 4 comments Joined 02/11
10 Nov 2011

Hi dnoeth,can you please help on this

question

I am working on TD13 trial version….

CREATE SET TABLE tduser.jn1_emp ,NO FALLBACK ,

NO BEFORE JOURNAL,

NO AFTER JOURNAL,

CHECKSUM = DEFAULT

(

emp_no INTEGER,

emp_loc varchar(12))

Unique PRIMARY INDEX ( emp_no );

Insert into tduser.jn1_emp(1,’hyd’);

Insert into tduser.jn1_emp(2,’bang’);

Insert into tduser.jn1_emp(3,’visak’);

Collect stats on tduser.jn1_emp index(emp_no);

CREATE SET TABLE tduser.jn2_emp ,NO FALLBACK ,

NO BEFORE JOURNAL,

NO AFTER JOURNAL,

CHECKSUM = DEFAULT

(

pme_no INTEGER,

emp_name varchar(12))

Unique PRIMARY INDEX ( pme_no );

Insert into tduser.jn2_emp(1,’raj’);

Insert into tduser.jn2_emp(2,’ravi’);

Insert into tduser.jn2_emp(4,’kishore’);

Collect stats on tduser.jn2_emp index(pme_no);

If I am trying to execute the following it is giving “low confidence” in the explain plan.can anybody suggest how to make it to “high confidence”

Explain sel * from tduser.jn1_emp, tduser.jn2_emp

Where emp_no = pme_no

dnoeth 86 comments Joined 11/04
30 Nov 2011

I don't think you will be able to get high confidence for this join.

Maybe adding a foreign key might help, but why do you actually want it?
The estimated number would probably not change.

Btw, you should post question like this as a new topic in an appropiate forum.

Dieter

Dieter

med 1 comment Joined 01/12
21 Jan 2012

hello, i need some help please,

I have a dataset similar to the following:

Patient_ID Unit Start_Time End_time
1234 3FE 3/2/09 12:00 3/5/09 13:00
1234 3FE 5/5/09 13:00 5/5/09 14:00
1234 3FE 5/5/09 14:00 5/7/09 18:00
1234 ICU 5/7/09 18:00 5/16/09 5:00
1234 3FE 5/16/09 5:00 5/23/09 7:00
1234 3FE 5/23/09 7:00 5/25/09 3:00
1234 3FE 5/25/09 3:00 5/28/09 8:00

The result set should look like the following:

Patient_ID Unit Start_Time End_time
1234 3FE 3/2/09 13:00 5/7/09 18:00
1234 ICU 5/7/09 18:00 5/16/09 5:00
1234 3FE 5/16/09 5:00 5/28/09 8:00

Any suggestions would be appreciated.

dnoeth 86 comments Joined 11/04
21 Jan 2012

Answered at
http://forums.teradata.com/forum/database/qualify-rank-over-partition-question#comment-18797

Dieter

Dieter

MI181902 3 comments Joined 03/11
15 Nov 2013

Hi,I'm SA in Japan Teradata associates .
Please help me.
I am looking at the median UDF function.(Environment that I want to implement is Teradata14.0)
Please send it to me, and if you have a function.
-I know that the function of median is implemented from Teradata14.10.
 

MI181902 3 comments Joined 03/11
15 Nov 2013

Hi,I'm SA in Japan Teradata associates .
Please help me.
I am looking at the median UDF function.(Environment that I want to implement is Teradata14.0)
Please send it to me, and if you have a function.
-I know that the function of median is implemented from Teradata14.10.
 

MI181902 3 comments Joined 03/11
15 Nov 2013

Hi,I'm SA in Japan Teradata associates .

 

Please help me.

 

I am looking at the median UDF function.(Environment that I want to implement is Teradata14.0)

 

Please send it to me, and if you have a function.

 

-I know that the function of median is implemented from Teradata14.10.

 

You must sign in to leave a comment.