All Forums Analytics
abhpathak 3 posts Joined 09/14
15 Sep 2014
Outlier Treatment in Teradata

Hi guys, 
I have a table at customer level and their corresponding payments data( Two columns: Cust_id and payment)  containing 20 million rows. I am trying to find outliers(0.1% top and bottom) based on payments.One method is using rank function and giving percentile values to each customer , but rank fuction does not work for such a large dataset. Is their any other way to do it ?

Abhinav Pathak, Business analyst, Latentview
dnoeth 4628 posts Joined 11/04
15 Sep 2014

Hi Abhinav,
why do you think that rank doesn't work for a 20 million row dataset? I wouldn't consider this big, I can easily run a similar query on VMWare version...
 

SELECT * FROM cust_table
QUALIFY -- maybe RANK instead of ROW_NUMBER
   ROW_NUMBER() OVER (ORDER BY payment) <= COUNT(*) OVER () * 0.001 
OR   
   ROW_NUMBER() OVER (ORDER BY payment) >= COUNT(*) OVER () * 0.999 

 

or using the deprecated QUANTILE function:

QUALIFY QUANTILE(1000, payment) in (0,999)

 

 

Dieter

abhpathak 3 posts Joined 09/14
16 Sep 2014

Hey
Actually, I have 2 segments across which I have to find outliers. So, code I have used is something like this and it is spooiling out.

		  		  select 
				 cust_id, 
				 offer_name,
				 contact_status,
				 tpv
				 from table_name
				 Qualify Row_number() over(partition by offer_name,contact_status order by tpv) < count(*) * 0.001 
				 group by 1,2,3,4;

 

Abhinav Pathak, Business analyst, Latentview

dnoeth 4628 posts Joined 11/04
16 Sep 2014

Hi Abhinav,
this is not going to work, you must change the count to COUNT(*) OVER (). And you can probably remove the GROUP BY, too.
What are the datatypes of those columns? 

Dieter

abhpathak 3 posts Joined 09/14
17 Sep 2014

Thanks for the quick response
Data types are

      cust_id CHAR(20) CHARACTER SET LATIN NOT CASESPECIFIC,
      CONTACT_STATUS INTEGER,
      OFFER_NAME VARCHAR(130) CHARACTER SET LATIN NOT CASESPECIFIC,
      tpv DECIMAL(18,4)

and even with COUNT(*) over() and removing group by , it is spooling out 

Abhinav Pathak, Business analyst, Latentview

dnoeth 4628 posts Joined 11/04
17 Sep 2014

Hi Abhinav,
seems like you have to talk to your DBA to get a higher spool limit.

Dieter

You must sign in to leave a comment.