All Forums Tools
neo_3072 6 posts Joined 03/08
05 Nov 2012
Use TPT to load a million flat files

Hi,
 
I am facing a unique challenge, problem is to process a million flatfiles for ELT.
Using TPT to leverage parallelism and optimized loading. However, acquisition phase is taking longer than expected.
 
I am using batch directory scan, with 2 file reader instances. The process picks up well and ends reading ~1000 files in less than 30 secs, however, it slows down once crosses ~3000 files.
 
Avg file size is 500KB.
Need inputs as to how this can be optimized further hence reducing the overall time.
Thanks

Tags:
feinholz 1234 posts Joined 05/08
06 Nov 2012

Have you tried using more than 2 instances of the file reader?
What version of TPT are you using?
 

--SteveF

neo_3072 6 posts Joined 03/08
06 Nov 2012

I tried using upto 4 instances for the file reader.
Version 13.10

feinholz 1234 posts Joined 05/08
07 Nov 2012

This is a bug that we had previously discovered and is being fixed in an upcoming efix/patch.
The version will be 13.10.00.12.
It will take about 4-6 weeks to appear on the Teradata At Your Service patch site.
 

--SteveF

neo_3072 6 posts Joined 03/08
07 Nov 2012

Thanks for the update.
However, I am curious to know, is that the processing can continue at the same pace (~1000 files in less than 30 secs) for a million files once this patch is installed?

feinholz 1234 posts Joined 05/08
08 Nov 2012

We have not had any customer try to read 1 million files.
We have customers who are processing between 10,000 and 50,000 files.
Are you using the wildcard syntax for the FileName attribute for the DC operator?

--SteveF

neo_3072 6 posts Joined 03/08
08 Nov 2012

Yes using "*" for the FileName.

feinholz 1234 posts Joined 05/08
08 Nov 2012

Are you checkpointing?
If so, how often?
You have all 1 million files in a single directory?
What is your row size (trying to get an idea how many rows per file)?
Are you using the Load operator?
How long are you *anticipating* the acquisition phase taking?
If this a one-time job, or something that will need to be run over and over again (e.g. daily, weekly, monthly, etc.)?
Again, we have not had anyone try to process this many files. Therefore, we cannot foresee the types of issues you might encounter.
Our file reader operator will attempt to store all of the file names and their sizes to try to perform load balancing; thus that is a lot of internal memory needed for that type of job.
I suspect you would do better with more than 2 instances of the file reader.
 

--SteveF

feinholz 1234 posts Joined 05/08
08 Nov 2012

I had a customer several years ago that had to load 25 8GB files. He found that the best performance game from using 12 file reader instances. YMMV (your mileage may vary). You will have to experiment.
What is the name of your company?

--SteveF

neo_3072 6 posts Joined 03/08
08 Nov 2012

 
Checkpointing: My understanding says checkpoints comes into play while applying rows to the DB. Not sure if file reader makes use of any checkpoint attribute. However, we are not specifying any checkpoint.
1 million files in a single directory: Well no, we can distribute across multiple directories on the same volume.
No. of Rows per file: wide range of values depending upon the size of the file. 500KB is just a average figure. 100 rows / 500KB
Are you using the Load operator?: Tried using Load as well as Stream operator.
How long are you *anticipating* the acquisition phase taking?: if we extrapolate the results, whole processing may take around 40-50 hrs, which doesn't seem appropriate.
Frequency: monthly
We tried using upto 10 readers, however throughput was still the same.

jinli 10 posts Joined 11/12
14 Nov 2012

are those 1 miilion files with the same format and need to load into one table? if so, have you considered to do some pre-processing? such as combine them to some adequaute number of bigger/huge files and then to load? 500kb is sort of too small.

neo_3072 6 posts Joined 03/08
18 Nov 2012

These files have a definite format - a header, body and footer. All blocks have some or the data to capture. Tried pre-processing and consolidation using Unix shell script, however parsing logic is pretty complex and has an adverse affect on CPU.
Data from these files need to be loaded to separate tables.

You must sign in to leave a comment.