All Forums Tools
abiffle 4 posts Joined 01/12
19 Feb 2014
Backlog issue occuring when UNION ALLing multiple TPT DataConnector operators into Stream

Hi all,
I am encountering an unexpected situation when using multiple DataConnector operators UNION ALLed together, and would appreciate your input/suggestions.
I have a process that uses TPT DataConnector and TPT Stream to load files that are being continuously received.  The process is been working fine, but requires two instances of the DataConnector operator in order to prevent file read from being a bottleneck.
I now have the requirement of loading the files in order received.  I can accomplish this through the VigilSortField property of DataConnector, but I cannot use multiple instances when using this property.
In order to get around this limitation, I was hoping to achieve the same parallelism by using two distinct DataConnector operators, each processing half the files and UNION ALLing into one Stream operator.  This approach seemed to fit my situation well, since two files with different file patterns are received every 5 seconds.
However, I am running into an issue because one of the two files is consistently larger.  The DataConnector operator processing the larger files is consistently falling behind, and is unable to ever process its backlog.  
Based on the feinholz quote below, I understand that the balancing between the two operators is based on file size rather than file count, so I believe it is expected that the operator processing the larger files would process fewer files per checkpoint.

"When using multiple instances to read from multiple files, we load balance the files across the instances according to the file sizes." how-this-works

However, my scenario in which files are constantly received is creating a situation in which this "larger-file" operator is never able process its backlog.  I would guess that this is occurring because whenever either DataConnecter finishes processing all files detected during its directory scan, it causes both operators to checkpoint.
For example:

  • Directory Scan:  DataConnector Small notices X files, DataConnector Large notices X files
  • Select Phase: DataConnector Small processes X files, DataConnector Large processes (X - R) files, leaving a remainder of R files unprocessed due to filesize-based balancing
  • Checkpoint (triggered by DataConnector Small finishing its directory scan list, even though DataConnector Large still has R files it could process)
  • --
  • Directory Scan:  DataConnector Small notices Y files, DataConnector Large notices Y + R files
  • etc -- R grows unbounded until VigilMaxFiles is exceeded and the process aborts

Can anyone suggest a way to resolve this issue?  Is there any setting that will change the balancing to file count rather than file size, or prevent the checkpoint from occurring until both DataConnector operators have processed their entire directory scan?
(I am using TPT 14.0, VigilMaxFiles of max of 50,000, VigilWaitTime of 1.  No -l latency_interval set on tbuild command.)


feinholz 1234 posts Joined 05/08
20 Feb 2014

A few thoughts.
If you are using the Stream operator, I doubt the file reading will ever be the bottleneck.
The Stream operator will probably always be running slower than the file reading.
Next, if you are worried about order, the UNION ALL will not help you. The data from the 2 DC operators will be merged together into the same data streams that feed the Stream operator and order cannot be guaranteed.
I am also not quite sure what you mean by "The DataConnector operator processing the larger files is consistently falling behind". Logically speaking, processing large files will always take longer than processing small files. Thus, the instance that is reading the small files will always finish prior to the instance processing the larger files.
Another thing. If you are using UNION ALL, the 2 DC operators do not know about each other. They run independently of each other and thus file size assigning does not come into play. Only when instance of a single DC operator are specified will the files be spread out between the instances according to size.


abiffle 4 posts Joined 01/12
21 Feb 2014

Thanks for the reply feinholz.  I am not going to pursue multiple reader operators for this process anymore, but I would like to reply to your individual thoughts too.
You are right that file reading is not the actual bottleneck.  In my situation, the overall performance is sufficient (read: amazing) when I increase the number of Stream instances.  But, I ran into deadlock issues due to rowhash locking, and had to enable Serialize.  Since this limited me to one Stream instance, I compensated with additional DataConnector instances, and did see an improvement in performance.  But since I agree that file reading is not the bottleneck, I assume this was due to some incidental benefit, such as the multiple DataConnectors increasing the number of IO buffers, and thereby better distributing the data sent to Stream.  If this is the case, perhaps changes to other settings would provide the same benefit without requiring the second operator.  But trial and error has not led me there yet.
Regarding UNION ALL not guaranteeing order, I understand.  This process is only concerned with file order, not row order.  The only desire is to load older files before newer files, when there is  a backlog.
It's good to know that the file size balancing does not apply in this case.  But to rephrase the part that you did not follow -- what concerns me is that it appears that when DataConnector A has finished processing all files in its current directory scan, it and DataConnector B both perform a checkpoint, even though DataConnector B has more files left it could have processed.  I understand that B would naturally take longer since it is processing larger files, but I worry that this premature checkpoint is the cause of DataConnector B falling behind, especially since one DataConnector on its own is sufficient to handle both sets of files.  So I wondered if there was any setting (such as -z option) that could change this behavior, so that DataConnector B would finish all the files in its active directory files before checkpointing.
It seems I may have wandered off into a use case that isn't worth supporting, though, and that I'm throwing the wrong tools at the problem of increasing throughput when both Serialize and VigilSortOrder are required.

feinholz 1234 posts Joined 05/08
21 Feb 2014

I will look into the checkpoint issue, to see if we have a synchronization issue regarding the use of UNION ALL. I know that when using multiple instances of a single DC operator, the checkpoint does not take place until all instances complete the processing of the files in their directory.


You must sign in to leave a comment.