All Forums Database
tddev1984 9 posts Joined 09/13
12 Sep 2013
Selecting duplicate records from a table - Complete record and not just the duplicate keys

Hi,
 
I have a situation where i have duplicate records in my table. This is determined based on a combination of a few natural/business keys.
I know the way to identify if there are duplicates using the below query :
SELECT COL1,COL2,COL3,COUNT(*)
FROM TABLE
GROUP BY COL1,COL2,COL3
HAVING COUNT(*)>1
Lets assume that this returns 2.6 million records.
My requirement is to retrieve the complete records from the table that are duplicated and NOT JUST the duplicate keys.
I coded the below SQL :
SELECT A.COL1,A.COL2,A.COL3,A.COL4,A.COL5
FROM
(
SELECT COL1,COL2,COL3,COL4,COL5
FROM TABLE
) A
INNER JOIN
(
SELECT DISTINCT COL1,COL2,COL3
FROM TABLE
GROUP BY COL1,COL2,COL3
HAVING COUNT(*)>1
) B
ON
A.COL1=B.COL1 AND
A.COL2=B.COL2 AND
A.COL3=B.COL3
This SQL is returining only 2.9 million records.  When i get 2.6 million records having more than 1 record for a key combination, when i join it with the table i should get a minimum of 2.6 Million *2 (assuming there are 2 duplicate keys) or more. But it can't be less.
I understand that I am not getting the desired results. Could you please share your thoughts if the above approach is correct? and if so, could you please share your thoughts on not getting the desired output?

M.Saeed Khurram 544 posts Joined 09/12
12 Sep 2013

Try the following query:

SELECT *
FROM TableName
WHERE (col1,col2,col3) IN (SELECT col1,col2,col3 
                             FROM TABLENAME 
                             GROUP BY 1,2,3 
                             HAVING COUNT(*)>1);

 

Khurram

dnoeth 4628 posts Joined 11/04
12 Sep 2013

You're correct, there should be at least twice the number of rows returned. Your code is overly complex, you don't need the DISTINCT and the Derived Table A, but this will not change the result:

SELECT A.*
FROM TABLE A
INNER JOIN
 (
   SELECT COL1,COL2,COL3
   FROM TABLE
   GROUP BY COL1,COL2,COL3
   HAVING COUNT(*)>1
 ) B
ON
A.COL1=B.COL1 AND
A.COL2=B.COL2 AND
A.COL3=B.COL3

 
There must be something else, could you post your actual query?
 
Depending on the number of columns in your table and the number of duplicates there is another approach:

SELECT *
FROM TABLE
QULAIFY
  COUNT(*) OVER (PARTITION BY COL1,COL2,COL3) > 1

 
Dieter

Dieter

tddev1984 9 posts Joined 09/13
12 Sep 2013

Hi ,
 
Thanks for your response. Yes, the query is a little overstated and thanks for simplifying it.
Please find the complete query that i was running below. I have  total of 15 columns and the duplicate keys are 6 :
SELECT A.GR,A.DT,A.FA,A.ERID,A.HID,A.OE,A.LOE,A.OBE,A.OBT,A.STE,A.CLD,A.GI,A.LT,A.RE,A.HTLT
FROM
(SELECT A.GR,A.DT,A.FA,A.ERID,A.HID,A.OE,A.LOE,A.OBE,A.OBT,A.STE,A.CLD,A.GI,A.LT,A.RE,A.HTLT
FROM
ADE.OBSN_E)  A
INNER JOIN
(SELECT DISTINCT GR,GI,OE,LOE,OBE,CLD
FROM ADE.OBSN_E
GROUP BY
GR,GI,OE,LOE,OBE,CLD
HAVING COUNT(*)>1)  B
ON
A.GR=B.GR AND
A.GI=B.GI AND
A.OE=B.OE AND
A.LOE=B.LOE AND
A.OBE=B.OBE AND
A.CLD=B.CLD
In the format mentioned by you :
SELECT A.* FROM
ADE.OBSN_E A
INNER JOIN
(
SELECT GR,GI,OE,LOE,OBE,CLD  
FROM ADE.OBSN_E  
GROUP BY GR,GI,OE,LOE,OBE,CLD 
HAVING COUNT(*)>1 
) B
ON
A.GR=B.GR AND
A.GI=B.GI AND
A.OE=B.OE AND
A.LOE=B.LOE AND
A.OBE=B.OBE AND
A.CLD=B.CLD

dnoeth 4628 posts Joined 11/04
12 Sep 2013

You said those columns are the Primary Key of the table, but there might be some NULLs in it?
This would result in returning less rows.
 
Dieter

Dieter

tddev1984 9 posts Joined 09/13
12 Sep 2013

Yes, that's right. I have verified the data in the table and could see that one or more columns (GR,GI,OE,LOE,OBE,CLD) are having NULL values.
Please note that these are not primary indexes/keys defined on the table but the natural/business keys from the source data.

tddev1984 9 posts Joined 09/13
13 Sep 2013

Hi,
Ignoring the fact that the tables has NULL values for natural key columns, could you let me know if the logic being used to capture the duplicate records is correct ?
Thanks Much!!!

Qaisar Kiani 337 posts Joined 11/05
13 Sep 2013

Why don't you run a test on some sample data????

Gowthami90 2 posts Joined 04/16
15 Apr 2016

Hi, 
I need teradata query for below scenario.
Table:
 NO     Name          YOJ       NCD     PY_Amt        RD_ID    PY_DT
  1       JOE             2010     CN      19010.00      213        01/20/2011
  2       JOE             2010     CN      19010.00      435        04/15/2012
  3       JOHN           2011     RT      34555.00      654        03/25/2013
  4       JOHN           2011     RT      23415.00      654        04/28/2014
  5      KANE           2012     ER     12211.00      1231       01/20/2013 
  5      KANE           2012     ER     12211.00      1231       02/20/2013 
  5      KIM             2010     TU     34521.00      1531       01/20/2013 
From the above table, we  need to pick the records which are having more than one record for the combination of Name,YOJ,NCD and differing with either of PY_Amt or RD_ID and From these we need to pick only earlier record based on PY_DT
Result table should be pick only the yellow marked records.
Could you please provide query for this
 
Thanks in advance,
Gowthami Krishnamoorthy
 

dnoeth 4628 posts Joined 11/04
17 Apr 2016

There are no yellow marked records, you must add the rows using the code editor, there you can highlight.
 
It's either based on COUNT OVER or maybe a simple ROW_NUMBER...

Dieter

Gowthami90 2 posts Joined 04/16
18 Apr 2016

Sorry for that.
Result table should be like:
 NO     Name          YOJ       NCD     PY_Amt        RD_ID    PY_DT
 2       JOE             2010     CN      19010.00      435        04/15/2012
  4       JOHN           2011     RT      23415.00      654        04/28/2014
As for 1&2 , 3&4 records are having duplicate when comparing Name,YOJ,NCD but differening either in PY_Amt ,RD_ID. And we need to pick latest record from this.  Hence expecting 2nd,4th rows in result

AtardecerR0j0 71 posts Joined 09/12
19 Apr 2016
select *
from YourTable
qualify rank() over( partition by Name, YOJ, NCD
                         order by PY_DT desc ) = 1

In the above code It's supposed there aren't dupplicate when comparing Name, YOJ, NCD and PY_DT

Be More!!

yuvaevergreen 93 posts Joined 07/09
19 Apr 2016

SEL A.* FROM 

(SEL A.*

FROM 

(SEL TBL.*  FROM TABLE TBL QUALIFY (COUNT(*) OVER(PARTITION BY NAME,YOJ,NCD ORDER BY PY_DT DESC) >1)) A

 QUALIFY COUNT(*) OVER(PARTITION BY NAME,YOJ,NCD,PY_AMT,RD_ID ORDER BY PY_DT DESC)=1

UNION

SEL A.*

FROM 

(SEL TBL.*  FROM TABLE TBL QUALIFY  (COUNT(*) OVER(PARTITION BY NAME,YOJ,NCD ORDER BY PY_DT DESC) >1)) A

QUALIFY  COUNT(*) OVER(PARTITION BY NAME,YOJ,NCD,PY_AMT,RD_ID ORDER BY PY_DT DESC) >1

AND 

(NAME,YOJ,NCD) IN

(SEL * FROM  (SEL A.NAME,A.YOJ,A.NCD FROM 

(SEL TBL.*  FROM TABLE TBL QUALIFY (COUNT(*) OVER(PARTITION BY NAME,YOJ,NCD ORDER BY PY_DT DESC) >1)

)A

QUALIFY 

COUNT(*) OVER(PARTITION BY NAME,YOJ,NCD,PY_AMT,RD_ID ORDER BY PY_DT DESC) =1

)A )

) A

QUALIFY 

ROW_NUMBER() OVER(PARTITION BY NAME,YOJ,NCD ORDER BY PY_DT DESC)=1

You must sign in to leave a comment.