All Forums UDA
Greyghost 51 posts Joined 07/09
23 Jul 2009
Difficulty Inserting records into Large table

Hello....I am trying to insert records (72 million+) from one table into a larger "master" table. The master table will be the final table after everything is complete. This table has 330 million rows in it currently. The insert is a simple bteq script: Insert into "master table" Select * from "subset table"I have broken the subset table into smaller table segments (4-10 million) thinking that would help ease the loading times. Even these smaller segments seem to take hours and in some cases days to load.The master table is defines as a SET table, and has a Primary Index and partitions by year. I have collected statistics on the key and the partitions, but this does not seem to have helped with the time to load.I was thinking about the SET vs MultiSet definition of the table. Will this help at all?Every year I will receive another 100 million records that need to be loaded to this table, so improving the processing is important.Any suggestions you have would be greatly appreciated.Thanks!Paul

Jimm 298 posts Joined 09/07
23 Jul 2009

If the table has a NUPI with lots of records with the same PI value (eg an account number for transactions), then Multiset will definitely help. It will bypass the duplicate row check which is almost certainly be causing the long load times. If you do go to a miltiset table though, you become responsible for ensuring there are no duplicate records!You should take a look at your partitioning scheme.I presume that all (or the vast majority) of the records you load are for the most recent period. You do not say whether you load the 100M records once per year, or as 9M approx per month.If you load 100M in January, 90M of these will go into last years partition, adding to the 9-10M you loaded for last January. 10M will go into the new partition.The 10M will go in much quicker because they are going into am empty partition.Try changing to monthly partitions instead of annual. This will limit you to 65K months, or 5000 years, so probably not a problem.

dnoeth 4628 posts Joined 11/04
23 Jul 2009

Hi Paul,do a HELP STATS on the PI-columns of that table and look for the Mode Frequency to check for duplicates.Could you post that plus PI + partitioning?Dieter

Dieter

Greyghost 51 posts Joined 07/09
23 Jul 2009

Hi Dieter,Here is what I get when I do a Help Statistics on the main table:Date Time Unique Values Column Names09/07/23 01:50:20 10,468,831 Irmf_Pyee_Id09/07/23 01:50:24 6 PARTITIONIs this the info you were asking for?Jim, the table is partitioned by YEAR. This is tax data I am dealing with, so I will get a new file once a year to load into this table. Unfortunately, I do not have a month that I could use for partitioning, only Tax Year.Here are the key fields as defined in the DDL now:PRIMARY INDEX XPKINFORMATION_RETURN ( Irmf_Pyee_Id )PARTITION BY RANGE_N(CAST((Tax_Yr_Dte ) AS INTEGER) BETWEEN 1980 AND 2100 EACH 1 , UNKNOWN)INDEX Pyertest1 ( Pyer_Adrs_Id )INDEX Pyeetest1 ( Pyee_Adrs_Id )INDEX PYerSec2 ( Irmf_Pyer_Id );Should I collect statistics on all of these index fields, in addition to the PI?Thanks for all the feedback!Paul

dnoeth 4628 posts Joined 11/04
23 Jul 2009

Hi Paul,this is approx. 33 rows per value in average, this should be ok (especially if it's in different years).But the maximum might be totally different, so please do a help stats only for the PI-column:HELP STATS tablename COLUMN Irmf_Pyee_Id;If QueryLog is enabled on your site you might also check it for CPU-usage of those ins/sel queries.Dieter

Dieter

Greyghost 51 posts Joined 07/09
23 Jul 2009

Hi Dieter,I am going to ask a dumb question. When you say Help Statistics on the PI, what do you mean? In my post above I showed what I get when I do a Help Statistics on the table, and that field listed is the PI. I'm not sure what to do to get the Statistics on the PI. Or are you asking about the skewness? That is 12.7Thanks!Paul

dnoeth 4628 posts Joined 11/04
23 Jul 2009

Hi Paul,i edited my previous reply, i don't know why it didn't show that HELP STATS statement, just tried to use a bold font...That skew sounds bad, it's probably causing that bad performance.Changing to MULTISET will probably help, but let's see the output of that HELP STATS.Dieter

Dieter

Greyghost 51 posts Joined 07/09
23 Jul 2009

Hi Dieter,When I run a Help Statistics on the table I am inserting into, here is what I get: Date Time Unique Values Column Names 09/07/23 13:02:26 10,473,786 Irmf_Pyee_Id 09/07/23 13:02:32 6 PARTITIONA 4.5 million row insert just completed, and that took 10 hrs! Something can't be right about how this table is set up.Thanks!paul

dnoeth 4628 posts Joined 11/04
23 Jul 2009

Hi Paul,do a help stats for the PI-columnHELP STATS tablename COLUMN Irmf_Pyee_Id;to get the details for the PI.4.5m rows in 10 hours is ridiculous, i can do that in a virtual machine on my laptop.You could also do a:select Irmf_Pyee_Id, Tax_Yr_Dte, count(*)from tabgroup by 1,2qualify rank() over (order by count(*) desc) <= 1to check for duplicates.Dieter

Dieter

Greyghost 51 posts Joined 07/09
23 Jul 2009

Hi Dieter,I agree...10 hrs to 4.5 mill is just unacceptable! That is why I am asking the forum for help.I ran the Help Stats statement you provided (thanks), it produced a pretty long output record. I am not sure what all you need to see, but I took a guess. Here it is:Number of Rows..........370692047Number of Nulls...........0Number of Intervals......100Number of Uniques........10473786Numeric......................NMin Value ...................0Mode Value.................300027069Mode Frequency..........5698859Max Value..................300027069Mode Value................141688470Mode Frequency.........4787856I hope this helps.Thanks!Paul

dnoeth 4628 posts Joined 11/04
23 Jul 2009

Hi Paul,that's exactly the info i needed. Your data is totally skewed, up to 5698859 rows with the same value mean a LOT duplicate row checks.If you can't change the PI, you should definitely change it to MULTISET, then those ins/sel will be fast.Dieter

Dieter

Greyghost 51 posts Joined 07/09
29 Jul 2009

Hey Dieter,Just to give you an update....Changing the table to Multiset worked! Huge performance improvements. I was able to load 72 million rows in 10 minutes!Many thanks for helping me work thru this issue.Thanks!Paul

Raja_KT 1246 posts Joined 07/09
30 Jul 2009

Hi,Can you please share your master table ddl strcuture at least the defination part..You can mask your field definations ......Thanks and regards,Raja

Raja K Thaw
My wiki: http://en.wikipedia.org/wiki/User:Kt_raj1
Street Children suffer not by their fault. We can help them if we want.

You must sign in to leave a comment.