All Forums Database
johnsunnydew 43 posts Joined 09/14
16 Mar 2015
Non duplicate row selection query

Hi there,
Have a table with 1 billion records in which almost close to half the records are of exact row duplicates. I need to select one record from the set of duplicates and load into another table.The query that i used is as below. The DIRECT_ID   column used in the order by clause is a sequence generated id and hence i was picking the most recent genarated id for that set of duplicates records. But this is taking too much of time.
Appreciate for suggestions..
 
Insert into New_Consumer

select *from CONSUMER_TABLE 

qualify row_number() over ( partition by PROCESS_DATA_MONTH,                  

GEOGRAPHY_CODE,                

PRODUCT_NAME,                  

VARIABLE_NAME,                 

TIME_PERIOD,                   

VARIABLE_QUANTITY,             

OUTLET_NAME,                   

BRAND_UNBRAND,             

CAMPAIGN_ID,                   

VENDOR,                        

CHANNEL_NAME,                  

ACTIVE_FLAG,                   

CREATED_DATE,                  

CREATED_BY,                    

MODIFIED_DATE,                 

MODIFIED_BY,                   

BATCH_ID,                      

BATCH_RUN_NUMBER,              

IMS_NUM,                   

GEO_ID,                  

TIME_ID,                       

AR_CUSTOMER_ID,            

CUSTOMER_ID,                   

AR_PRODUCT_ID,             

PRODUCT_ID,                   

INDICATION,                    

PROGRAM_NAME,                  

PROGRAM_CODE,                  

MEDIA_SHOW_TIME,               

PROGRAM_DURATION,              

MEDIA_TYPE,                    

CAMPAIGN_NAME,                 

PLACEMENT_NAME,                

METRIC_TYPE,                   

VARIABLE_DESCRIPTION,          

CHANNEL_ID,                    

MARKET_IDENTIFIER,         

PRODUCT_IDENTIFIER,        

ABBREVIATION order by DIRECT_ID  desc)=1

 

 

Regards

John

SmarakDas 51 posts Joined 02/12
17 Mar 2015

Hello John,

 

You can try the below approach also & see if the Performance Gain is positive:

 

INSERT Into <Target Table>

Select Max(DIRECT_ID), <All Columns> From <Table Name> Group By <All Columns Except DIRECT_ID>

 

Aggregate Operation should give better performance than the PARTITIONING Operations.

 

Thanks,

Smarak

johnsunnydew 43 posts Joined 09/14
17 Mar 2015

Hi Smarak,
Thanks. Not much improvement. I ran the query for almost 15 minutes and then aborted it:(
Regards
John

yuvaevergreen 93 posts Joined 07/09
17 Mar 2015

If you have access to DDLS,
Create a temp table as set table - Table 1
Insert the rows from the source to "Table 1"
Insert the rows from "Source"  minus "Table1" to target table based on DIRECT_ID.
i.e. Source - 1,2,1,3
Table 1  - 1,2,3
Target table - Source - Table1 - 1
This would work but not sure about time.
Thanks
Yuva.

ravimans 54 posts Joined 02/14
17 Mar 2015

Hi John,
Apart from DIRECT_ID is there any other columns which you can use as a differentiating column between two records. For ex: date or timestamp column etc..
When you are inserting to New_Consumer table, do you want to select all distinct records from CONSUMER_TABLE table. Is this a one time activity or if it's a daily activity why you need to load all records?

dnoeth 4628 posts Joined 11/04
17 Mar 2015

There's no way to avoid the GROUP/PARTITION BY all columns.
Clean your data once and next time simply don't insert duplicates, your load process should take care of this.
 
There should be a logical Primary Key on your table, but that's not a sequence like DIRECT_ID.

Dieter

johnsunnydew 43 posts Joined 09/14
17 Mar 2015

Hi Ravimans,
We do not have any other column that differentiates the records apart from the DIRECT_ID column.
Yes we want to select all the distinct records from consumer table and will be a one time activity..
Regards
John

johnsunnydew 43 posts Joined 09/14
17 Mar 2015

Hi Dnoeth,
Unfortunately there is no there key which differentiates the record and which i admit is poor design. Now living wih what we have, we are trying to do a one time activity of loading disitnct records into a new table from the table that has duplicates. Have used the same query as i have pasted above with partition and i did a explain too.. the explain is not showing any time that i might take to complete the process.. Any help or advice?
 
Regards
John

yuvaevergreen 93 posts Joined 07/09
18 Mar 2015

Logic given by me may not work..since direct_id is different...My bad :(
 
Thanks
Yuva

johnsunnydew 43 posts Joined 09/14
18 Mar 2015

No Worries Yuva.. Any other ways? I have tried using Partition and group by but both are taking too long..Any help?
 
Regards
John

Fred 1096 posts Joined 08/04
18 Mar 2015

Are there are a small number of columns that together would be suitable as NUPI, i.e. well-distributed and ideally not too many duplicates? If so, then explicitly redistributing by those columns (e.g. creating a Volatile Table copy with that NUPI) prior to the aggregation would likely provide some limited benefit.

ravimans 54 posts Joined 02/14
18 Mar 2015

Can you try including DIRECT_ID column also in the Partition by and rerun and see the performance.

johnsunnydew 43 posts Joined 09/14
19 Mar 2015

Hi All,
I have executed  the querie below to get the distinct records , butt quite unluckily as it throws me out with spool space error. Even with group by also its the same.
Except the Direct_Id column which is a sequence generated, rest all are same. Any insight or help is much appreciated. Query\ below.
1)
select *from CONSUMER_TABLE 
qualify row_number() over ( partition by PROCESS_DATA_MONTH,                  
GEOGRAPHY_CODE,                
PRODUCT_NAME,                  
VARIABLE_NAME,                 
TIME_PERIOD,                   
VARIABLE_QUANTITY,             
OUTLET_NAME,                   
BRAND_UNBRAND,             
CAMPAIGN_ID,                   
VENDOR,                        
CHANNEL_NAME,                  
ACTIVE_FLAG,                   
CREATED_DATE,                  
CREATED_BY,                    
MODIFIED_DATE,                 
MODIFIED_BY,                   
BATCH_ID,                      
BATCH_RUN_NUMBER,              
IMS_NUM,                   
GEO_ID,                  
TIME_ID,                       
AR_CUSTOMER_ID,            
CUSTOMER_ID,                   
AR_PRODUCT_ID,             
PRODUCT_ID,                   
INDICATION,                    
PROGRAM_NAME,                  
PROGRAM_CODE,                  
MEDIA_SHOW_TIME,               
PROGRAM_DURATION,              
MEDIA_TYPE,                    
CAMPAIGN_NAME,                 
PLACEMENT_NAME,                
METRIC_TYPE,                   
VARIABLE_DESCRIPTION,          
CHANNEL_ID,                    
MARKET_IDENTIFIER,         
PRODUCT_IDENTIFIER,        
ABBREVIATION order by DIRECT_ID  desc)=1
 
 
 
 

yuvaevergreen 93 posts Joined 07/09
20 Mar 2015

Is it possible to create one more column?. If yes, then rec md5 sum of the columns excluding direct_id can be created, which can be used in partition by. This will improve the performance, I hope. If possible, try it.
 
Thanks,
Yuva.

dnoeth 4628 posts Joined 11/04
22 Mar 2015

Yuva's approach utllizing a hash function like SHA (but don't use HASHROW) should work.
You don't have to add it as a new column, simply calculate it on the fly concatenating all columns. The GROUP BY will use less spool than PARTITION BY:

qualify
   row_number() 
   over (partition by 
           hash_sha512(TRIM(PROCESS_DATA_MONTH) || '.' ||
                       TRIM(GEOGRAPHY_CODE) || '.' ||      
         ...
         order by DIRECT_ID) = 1

select * from CONSUMER_TABLE 
WHERE DIRCT_ID IN
  (select min(DIRECT_ID)
   from CONSUMER_TABLE
   GROUP BY hash_sha512(TRIM(PROCESS_DATA_MONTH) || '.' ||
                       TRIM(GEOGRAPHY_CODE) || '.' ||      
         ...)

There's an existing SHA-2 implementation, but it's not implemented for Unicode: http://forums.teradata.com/forum/extensibility/sha-2-udfs-for-teradata

Dieter

johnsunnydew 43 posts Joined 09/14
22 Mar 2015

Hi Dieter,
Thanks. Just executed and it throws out with the error 3706. Expected something between ( and  the Trim keyword.
 

select * from CONSUMER_TABLE

WHERE DIRCT_ID IN

  (select min(DIRECT_ID)

   from CONSUMER_TABLE

   GROUP BY hash_sha512(TRIM(PROCESS_DATA_MONTH) || '.' ||

    TRIM(GEOGRAPHY_CODE) || '.' || 

                       TRIM(PRODUCT_NAME) || '.' || 

                       TRIM(VARIABLE_NAME) || '.' || 

                       TRIM(TIME_PERIOD) || '.' || 

                       TRIM(VARIABLE_QUANTITY) || '.' || 

                       TRIM(OUTLET_NAME) || '.' || 

                       TRIM(BRANDED_UNBRANDED) || '.' || 

                       TRIM(CAMPAIGN_ID) || '.' || 

                       TRIM(VENDOR) || '.' ||  

                       TRIM(CHANNEL_NAME) || '.' ||  

                       TRIM(ACTIVE_FLAG) || '.' ||  

                       TRIM(CREATED_DATE) || '.' ||  

                       TRIM(IMS_NUMBER) || '.' ||  

                       TRIM(GEOGRAPHY_ID) || '.' ||  

                       TRIM(TIME_ID) || '.' ||  

                       TRIM(AR_CUSTOMER_ID) || '.' ||  

                       TRIM(CUSTOMER_ID) || '.' ||  

                       TRIM(AR_PRODUCT_ID) || '.' ||  

                       TRIM(PRODUCT_ID) || '.' ||  

                       TRIM(INDICATION) || '.' ||  

                       TRIM(PROGRAM_NAME) || '.' ||  

                       TRIM(PROGRAM_CODE) || '.' ||  

                       TRIM(MEDIA_SHOW_TIME) || '.' ||  

                       TRIM(PROGRAM_DURATION) || '.' ||  

                       TRIM(MEDIA_TYPE) || '.' ||  

                       TRIM(CAMPAIGN_NAME) || '.' ||  

                       TRIM(PLACEMENT_NAME) || '.' ||  

                       TRIM(VARIABLE_DESCRIPTION) || '.' ||  

                       TRIM(CHANNEL_ID) || '.' ||  

                       TRIM(MARKET_IDENTIFIER) || '.' ||  

                       TRIM(PRODUCT_IDENTIFIER) || '.' ||  

                       TRIM(ABBREVIATION) || '.' || 

                       ))

                       

 

dnoeth 4628 posts Joined 11/04
23 Mar 2015

You need to install a hash function, hash_sha512 doesn't exist on your system.

Dieter

lecslecin 4 posts Joined 08/14
31 Mar 2015

If DIRECT_ID value doesn't matter in target:
Step1: Create a SET table New_Consumer excluding DIRECT_ID column
Step2: Insert from Consumer_Table to New_Consumer, your duplicates will go off.
 

You must sign in to leave a comment.