All Forums General
TD_Raj 50 posts Joined 05/10
16 Feb 2012
Which is better GROUP BY, DISTINCT or SET table

I want to know which one is better to handle duplicate records ; is it using GROUP BY or DISTINCT while selecting data from Source table or Creating Target table as SET table ?

Thanks in advance.

Nitin 'Raj' Srivastava




ulrich 816 posts Joined 09/09
16 Feb 2012

classical answer for an unspefic question ;-> - it depends.

Can you share more information on the process. How many rows, how many dups etc. Where do the dups come from?

But even if you give this info - the answer it will be difficult to judge as the system configuration can also make a difference. So running some tests on your environment will give you best answers.

Dedup can become an expensive operation should be done only once and as early as possible. Lassy design can result in a lot of group by and distinct usage and will consume huge amount of resources...

feel free to donate bitcoin:12kgAUHFUqvG2sQgaRBXFhCwyf9HXdkGud

Stefans 38 posts Joined 02/12
22 Feb 2012

It again depends on the number of rows per value present in the table.

GROUPBY can be used if the table contains more duplicate rows per value and DISTINCT incase of less duplicate rows per value.



VasuKillada 31 posts Joined 10/11
25 Feb 2012

Creating a SET table in the target makes the duplicate check for the entire row. Using GROUP BY or DISTINCT in the selecting data for your source tables does the check for the columns specified in your GROUP BY or DISTINCT clause. Both the plans for GROUP BY and DISTINCT are similar but it does make a difference based on the data set.


marcel.toledo 1 post Joined 02/12
27 Feb 2012

I think that SET table are useful to garantee the integrity when you know that you gonna have a poor ETL development. Considering only the performance objective, it depends. You should test your case.

medvacflights 1 post Joined 02/12
27 Feb 2012

Is it? Well, I might as well try...

asadali.khan 11 posts Joined 10/10
29 Feb 2012

Performance varies depending on the data.


is better when the data is nearly unique. actually when distinct is used; the intermediate spool is sorted and discards the duplicates.


is better when data has relatively few unique values and works by performing an AMP local grouping operation and then merging the partial result sets for final processing.

Asad Ali Khan

ulrich 816 posts Joined 09/09
29 Feb 2012

also check 13.0 release summary (see attachement).

Optimzer is getting smarter and sometimes distinct and group by result in the same plan.

feel free to donate bitcoin:12kgAUHFUqvG2sQgaRBXFhCwyf9HXdkGud

You must sign in to leave a comment.