

Missing Functions: PERCENTILE_DISC, PERCENTILE_CONT & MEDIAN
In my previous posts on missing functions i covered SQL:2008 DENSE_RANK and CUME_DIST & NTILE.
This time i'll talk about additional functions based on sorted sets of values, SQL:2008's PERCENTILE_DISC and PERCENTILE_CONT plus a MEDIAN calculation.
PERCENTILE_DISC
The CUME_DIST function computes the cumulative distribution of a set of values. The inverse operation to find the value for a given percentile is done using PERCENTILE_DISC.
According to SQL:2008 PERCENTILE_DISC(x) is the first value with a CUME_DIST greater than or equal to x.
This directly translates to the row with a
ROW_NUMBER() OVER (PARTITION BY part_col ORDER BY order_col) = CEILING(COUNT(*) OVER (PARTITION BY part_col) * x
There's no CEILING function prior to TD13.10, but this returns the same value:
CAST(COUNT(*) OVER (PARTITION BY part_col) * x + 0.999999999999999 AS INT)
Caution:
This calculation will not work for x = 0, but you better use good old MIN instead of PERCENTILE_DISC(0).
-- SQL:2008 PERCENTILE_DISC aggregate function SELECT part_col ,PERCENTILE_DISC(x) WITHIN GROUP (ORDER BY data_col) --aggregate function FROM tab GROUP BY part_col; -- Teradata equivalent SELECT part_col ,data_col FROM tab QUALIFY ROW_NUMBER() OVER (PARTITION BY part_col ORDER BY data_col) = CAST(x * COUNT(*) OVER (PARTITION BY part_col) + 0.999999999999999 AS INT);
PERCENTILE_DISC is also defined as OLAP function adding the OVER clause. This requires moving the QUALIFY-condition into a CASE and another nested OLAP function:
-- SQL:2008 PERCENTILE_DISC OLAP function SELECT part_col ,data_col ,PERCENTILE_DISC(x) WITHIN GROUP (ORDER BY data_col) OVER (PARTITION BY part_col) FROM tab -- Teradata equivalent SELECT part_col ,data_col ,MIN(pc) OVER (PARTITION BY part_col) AS PERCENTILE_DISC FROM ( SELECT part_col ,data_col ,CASE WHEN ROW_NUMBER () OVER (PARTITION BY part_col ORDER BY data_col) = CAST(COUNT(*) OVER (PARTITION BY part_col) * x + 0.999999999999999 AS INT) THEN data_col END AS pc FROM tab ) AS dt
PERCENTILE_CONT
Whereas PERCENTILE_DISC returns one of the existing values, PERCENTILE_CONT is based on a linear interpolation between two consecutive values. First you calculate a hypothetical row number rn = (x * (n - 1)) + 1, where x is the percentile and n is the number of rows per group. If rn has no fractional part then you already found the result row, else you have to consider the following row, too:
-- SQL:2008 PERCENTILE_CONT aggregate function SELECT part_col ,PERCENTILE_CONT(x) WITHIN GROUP (ORDER BY order_col) FROM tab GROUP BY part_col; -- Teradata equivalent SELECT part_col ,data_col + ((MIN(data_col) OVER (PARTITION BY part_col ORDER BY data_col ROWS BETWEEN 1 FOLLOWING AND 1 FOLLOWING) - data_col) * (((COUNT(*) OVER (PARTITION BY part_col) - 1) * x) MOD 1)) AS percentile_cont FROM tab QUALIFY ROW_NUMBER() OVER (PARTITION BY part_col ORDER BY data_col) = CAST((COUNT(*) OVER (PARTITION BY part_col) - 1) * x AS INT) + 1; -- SQL:2008 PERCENTILE_CONT OLAP function SELECT part_col, data_col, PERCENTILE_CONT(x) WITHIN GROUP (ORDER BY data_col) OVER (PARTITION BY part_col) FROM tab -- Teradata equivalent SELECT part_col ,data_col ,MIN(pc) OVER (PARTITION BY part_col) AS PERCENTILE_CONT FROM ( SELECT part_col ,data_col ,COUNT(*) OVER (PARTITION BY part_col) - 1 AS N ,ROW_NUMBER() OVER (PARTITION BY part_col ORDER BY data_col) - 1 AS rowno ,MIN(data_col) OVER (PARTITION BY part_col ORDER BY data_col ROWS BETWEEN 1 FOLLOWING AND 1 FOLLOWING) - data_col AS diff ,CASE WHEN rowno = CAST(N * x AS INT) THEN data_col +(((N * x) MOD 1) * diff) END AS pc FROM tab ) AS dt
The CASE technique can also be used to calculate multiple percentiles in a single query, e.g. the Quartiles as aggregate functions:
SELECT part_col ,MIN(pc25) OVER (PARTITION BY part_col) AS quartile_1 ,MIN(pc50) OVER (PARTITION BY part_col) AS quartile_2 ,MIN(pc75) OVER (PARTITION BY part_col) AS quartile_3 FROM ( SELECT part_col ,COUNT(*) OVER (PARTITION BY part_col) - 1 AS N ,ROW_NUMBER() OVER (PARTITION BY part_col ORDER BY data_col) - 1 AS rowno ,MIN(data_col) OVER (PARTITION BY part_col ORDER BY data_col ROWS BETWEEN 1 FOLLOWING AND 1 FOLLOWING) - data_col AS diff ,CASE WHEN rowno = CAST(N * 0.25 AS INT) THEN data_col +(((N * 0.25) MOD 1) * diff) END AS pc25 ,CASE WHEN rowno = CAST(N * 0.50 AS INT) THEN data_col +(((N * 0.50) MOD 1) * diff) END AS pc50 ,CASE WHEN rowno = CAST(N * 0.75 AS INT) THEN data_col +(((N * 0.75) MOD 1) * diff) END AS pc75 FROM tab QUALIFY rowno = CAST(N * 0.25 AS INT) OR rowno = CAST(N * 0.50 AS INT) OR rowno = CAST(N * 0.75 AS INT) ) AS dt QUALIFY ROW_NUMBER() OVER (PARTITION BY part_col ORDER BY part_col) = 1
Omitting both QUALIFY clauses will return the quartiles as OLAP functions.
MEDIAN
The MEDIAN is just the special case of the 50th percentile.
In SQL:2008 there's no reason to define an extra funcion for that, but above formula can be simplified for a median calculation:
If the number of rows N is odd, then the median value is found in row (N/2)+1 else it's the average of this row's and the previous row's value:
-- MEDIAN aggregate function SELECT part_col ,MEDIAN(data_col) FROM tab GROUP BY part_col; -- Teradata equivalent SELECT part_col ,CASE WHEN COUNT(*) OVER (PARTITION BY part_col) MOD 2 = 1 THEN data_col ELSE AVG(data_col) OVER (PARTITION BY part_col ORDER BY data_col ROWS 1 PRECEDING) END FROM tab QUALIFY ROW_NUMBER() OVER (PARTITION BY part_col ORDER BY data_col) = COUNT(*) OVER (PARTITION BY part_col) / 2 + 1; -- MEDIAN OLAP function SELECT part_col ,MEDIAN(data_col) OVER (PARTITION BY part_col) FROM tab GROUP BY part_col; -- Teradata equivalent SELECT part_col ,MIN(med) OVER (PARTITION BY part_col) AS "MEDIAN" FROM ( SELECT part_col ,CASE WHEN ROW_NUMBER() OVER (PARTITION BY part_col ORDER BY data_col) = COUNT(*) OVER (PARTITION BY part_col) / 2 + 1 THEN CASE WHEN COUNT(*) OVER (PARTITION BY part_col) MOD 2 = 1 THEN data_col ELSE AVG(data_col) OVER (PARTITION BY part_col ORDER BY data_col ROWS 1 PRECEDING) END END AS med FROM tab ) AS dt
This MEDIAN calculation returns the same result as the correspondent function in Excel and PERCENTILE_CONT reproduces Excel's PERCENTILE function.
A year or so ago I have been working on the same topic and came up with the following implementations. The mathematical definition used here must be a bit different from the SQL standart but perhaps it is worth for you to comment on.
/*** "financial median": Group COUNT + ROW_NUMBER ***/
SELECT
dt1.part_col
,AVG(data_col)
FROM
(SELECT
part_col
,data_col
,ROW_NUMBER() OVER (PARTITION BY part_col
ORDER BY data_col) AS row_num
,COUNT(*) OVER (PARTITION BY part_col
ROWS BETWEEN UNBOUNDED PRECEDING
AND UNBOUNDED FOLLOWING) AS row_count
FROM tab
) AS dt1
WHERE
row_num = (row_count + 1) / 2
OR
row_num = (row_count / 2) + 1
GROUP BY part_col
ORDER BY part_col
;
/*** "statistical median": Group COUNT + ROW_NUMBER ***/
SELECT
dt1.part_col
,data_col
FROM
(SELECT
part_col
,data_col
,ROW_NUMBER() OVER (PARTITION BY part_col
ORDER BY data_col) AS row_num
,COUNT(*) OVER (PARTITION BY part_col
ROWS BETWEEN UNBOUNDED PRECEDING
AND UNBOUNDED FOLLOWING) AS row_count
FROM tab
) AS dt1
WHERE
row_num = (row_count + 1) / 2 --left (lesser) value
-- row_num = (row_count / 2) + 1 --right (greater) value
ORDER BY part_col
;
input output putput
Hi Egor,
i know those queries, i wrote them a few years ago :-)
The 1st is exactly the same as the SQL:2008 calculation, I only changed the AVG calculation from aggregate to OLAP.
Dieter
Dieter
Hi. Is it possible to write OLTP UDF in Teradata? Functions are good, but it is not convenient to use them...
Did you mean OLAP UDF?
Yes, you can write your own OLAP functions (in C) in TD13.10, there's a DENSE_RANK example in the "SQL External Routine Programming" manual.
But it's probably quite hard to do so unless you are an experienced C-programmer and find a good algorithm to implement...
Dieter
Dieter
Hi dnoeth,can you please help on this
question
I am working on TD13 trial version….
CREATE SET TABLE tduser.jn1_emp ,NO FALLBACK ,
NO BEFORE JOURNAL,
NO AFTER JOURNAL,
CHECKSUM = DEFAULT
(
emp_no INTEGER,
emp_loc varchar(12))
Unique PRIMARY INDEX ( emp_no );
Insert into tduser.jn1_emp(1,’hyd’);
Insert into tduser.jn1_emp(2,’bang’);
Insert into tduser.jn1_emp(3,’visak’);
Collect stats on tduser.jn1_emp index(emp_no);
CREATE SET TABLE tduser.jn2_emp ,NO FALLBACK ,
NO BEFORE JOURNAL,
NO AFTER JOURNAL,
CHECKSUM = DEFAULT
(
pme_no INTEGER,
emp_name varchar(12))
Unique PRIMARY INDEX ( pme_no );
Insert into tduser.jn2_emp(1,’raj’);
Insert into tduser.jn2_emp(2,’ravi’);
Insert into tduser.jn2_emp(4,’kishore’);
Collect stats on tduser.jn2_emp index(pme_no);
If I am trying to execute the following it is giving “low confidence” in the explain plan.can anybody suggest how to make it to “high confidence”
Explain sel * from tduser.jn1_emp, tduser.jn2_emp
Where emp_no = pme_no
I don't think you will be able to get high confidence for this join.
Maybe adding a foreign key might help, but why do you actually want it?
The estimated number would probably not change.
Btw, you should post question like this as a new topic in an appropiate forum.
Dieter
Dieter
hello, i need some help please,
I have a dataset similar to the following:
Patient_ID Unit Start_Time End_time
1234 3FE 3/2/09 12:00 3/5/09 13:00
1234 3FE 5/5/09 13:00 5/5/09 14:00
1234 3FE 5/5/09 14:00 5/7/09 18:00
1234 ICU 5/7/09 18:00 5/16/09 5:00
1234 3FE 5/16/09 5:00 5/23/09 7:00
1234 3FE 5/23/09 7:00 5/25/09 3:00
1234 3FE 5/25/09 3:00 5/28/09 8:00
The result set should look like the following:
Patient_ID Unit Start_Time End_time
1234 3FE 3/2/09 13:00 5/7/09 18:00
1234 ICU 5/7/09 18:00 5/16/09 5:00
1234 3FE 5/16/09 5:00 5/28/09 8:00
Any suggestions would be appreciated.
Answered at
http://forums.teradata.com/forum/database/qualify-rank-over-partition-question#comment-18797
Dieter
Dieter
Hi,I'm SA in Japan Teradata associates .
Please help me.
I am looking at the median UDF function.(Environment that I want to implement is Teradata14.0)
Please send it to me, and if you have a function.
-I know that the function of median is implemented from Teradata14.10.
Hi,I'm SA in Japan Teradata associates .
Please help me.
I am looking at the median UDF function.(Environment that I want to implement is Teradata14.0)
Please send it to me, and if you have a function.
-I know that the function of median is implemented from Teradata14.10.
Hi,I'm SA in Japan Teradata associates .
Please help me.
I am looking at the median UDF function.(Environment that I want to implement is Teradata14.0)
Please send it to me, and if you have a function.
-I know that the function of median is implemented from Teradata14.10.