All Forums UDA
marcmc 112 posts Joined 12/05
02 Jul 2007
Fast/Multi Load & Control Characters

Our source data is loaded into our warehouse staging area via combination of Fast & Multi Load utilities.From time to time some CNTRL characters can slip into the source systems data.For example an email address such as Paul.O¢Hara@gmail.com appeared last week where the cent character was meant to be an apostrophy.We use the caret^ symbol as a delimiter for the fastload files and so they are what delimits the columns in the .txt file created from fastload.Everyone has to deal with Control characters and we are wondering what approaches some of you guys have employed.Marc.

joedsilva 505 posts Joined 07/05
02 Jul 2007

Use fixed width, it's also much faster to load.The issue of space is normally circumvented in unix by using compression utilities (ie ask the source system to send zipped files) and loading them via named FIFOs.

marcmc 112 posts Joined 12/05
02 Jul 2007

excuse my ignorance.but what do you mean by 'fixed width' and 'named FIFO'?Can you use examples?Marc

joedsilva 505 posts Joined 07/05
02 Jul 2007

fixed width format is when you specify what is the maximum length of each field, and pad the rest with zero/spaces.for exampleif this is a variable text format, with pipe delimiters.123|shannera@grare.com|NYand if the max allowed size of the first field (assuming integer) is 5, second one to be 30 and last field is 2 then your data record format would look like. (there are no delimiters all fields start and end at fixed positions)00123shannera@grare.com<12 spaces>NYand in your load scripts, you can specify the record format to be TEXT.you will also have to change your input field specifications in the script from VARCHAR(n) to CHAR(n) also use NULLIF checks to translate spaces to null values etc. lookup the manual for more details.named FIFOs are a feature in unix where a process can write to a FIFO and another process can read from it (similar to using pipes in shell commands).for example if I had a zipped file data, I would do something like this in my shell script.#!/usr/bin/ksh# create a fifo for use, only this userid will have read/write access to itmkfifo -m 600 mymfifo# unzip the contents of my data file and write the output to fifo, do it in backgroundgunzip - myfifo &#store the pid of the background process.unzippid=$!# the usual fast load stuff ...fastload <define the record format as text, do other fastload gizzmos.point the fastload to read from the fifoFILE=myfifoENDstore the return code of fastloadrc=$?# if fastload failed, kill the background processif [ $rc -ne 0 ]then kill $unzippidfirm -f myfifo# end of script

marcmc 112 posts Joined 12/05
04 Jul 2007

Thanks Joe.One of the better ideas I've been presented with is as below. Any thoughts are very much appreciated?The problem relates to what is a control character and what is not. The cent character is not a control character, neither is it a standard ascii character, it's a special character, which differentiates it again from a non-printing character.What could be used is a better filter than just stripping at the Source System or Linux end. A translation table for problematic characters, i.e., ^ in the text is translated to ', cent character is translated to something else, and so on. The issue is where that should happen i.e. all character conversion and stripping should happen on the Source system then the removal of all control characters on Linux, with a secondary filter to check and translate any remaining characters that might causeissues.At this point in developent we feel fixed width is not an option and we are still exploring the named FIFO option.

joedsilva 505 posts Joined 07/05
06 Jul 2007

Yes, it would be more of a data quality issue, that needs to be fixed at the source system. I am personally against any translation done at the receiving system, as it might have the wrong results at times, you shouldn't end up translating something that shouldn't have got translated. (and which might propagate downstream further complicating things).May be this is something the BAs and DAs should get together and decide up on. (wash the linen only on the upstream ! :-) )

j355ga 100 posts Joined 12/05
06 Jul 2007

In the ideal world there would never be delimiters or control (unprintable) characters imbedded in your data. Since we live in the real world you have to deal with this issue. As already stated you can try to correct the source system, but often times this is not possible either. Here are some solutions I've seen used:1. Pick a delimiter that can "never" appear in the data. For example be ASCII 127 ( 0x7F ) is the delete character. The delete character can be typed but it never appears in the file you are working on. This works quite well but is not intuitive when reviewing the data files. 2. If the source system is Oracle you can cleanse the offending columns by writing a cleanse function to change the unprintable characters to spaces. 3. Use perl or awk to change the unprintable characters in the source file: perl -ne 's/[[:cntrl:]]//g;print' file.txt4. Use perl or awk to count the number of delimiters per row and reject the row or file if incorrect - meaning that there were delimiters embedded in the data.5. Use double quotes around each column in the source file - when using tools like Informatica you can specify delimited and quoted input files.There are probably more solutions but you get the idea....

Jeff

marcmc 112 posts Joined 12/05
26 Jul 2007

We are researching the idea of cleaning at the source.The problem is at the moment that the main 'block looking' cntrl characters(they seem to be carriage returns and line feeds) we are receiving most, don't all have the same hex value. So we are trying to build up a library of ebcdic hex values that we can scan our source files for before ETL.Also, has anyone used/researched Teradata Data Mirroring products for anything in this context?Finally, if we were to do something as radical as use another products Utility(say DTS in SQLServer which is currently working for us) to get the files cleanly across could we dynamically call something like the OleLoad utility for an array of files from a Unix/Linux script?

Kevin Liu 3 posts Joined 06/05
18 Sep 2007

You can FastLoad from a DTS package by using the DTSPackageDSO OLE DB provider and the OLE DB access module. Here is a small experiment that you can use to verify that this works:* On a system having Microsoft SQL Server 2000 installed, click "Start" then "Programs" then "Microsoft SQL Server" then "Import and Export Data". Follow the wizard steps to create a "Microsoft OLE DB Provider for SQL Server" to "Microsoft OLE DB Provider for SQL Server" package. On the "Specify Table Copy or Query" step, choose "Copy table(s) and view(s) from the source database". On the "Save, Schedule, and replicate package" step, uncheck all of the check boxes in the "When" group box, check the "Save DTS Package" check box and the "Structured Storage File" radio button in the "Save" group box.* Start the "SQL Server Enterprise Manager" and find the "Data Transformation Services" icon. Right click it and choose "Open package" to open the package that you created with the "Import and Export Data" wizard. Right click the "transform data task" icon (it is an arrow) and click "Workflow Properties". On the "Options" tab in the "Execution" group box set the "DSO rowset provider" check box. Also, copy down the entire "Name" of the transform data task. It is important the get the entire name. It may not all be displayed and you might have to click on it and then move the cursor all the way right and left over it to get the whole name.* Save the package.* Start OleLoad and for the "Select a source" group box, select the "Microsoft OLE DB Provider for DTS Packages ("DTSPackageDSO")" entry. On the "All" tab of the "Data Link Properties" dialog box, set the value of the Provider String property to /Ffilename (where filename is the full pathname of your DTS package).* In the Enter the name of the desired table edit box, enter the name of the transform data task.The rest is a normal FastLoad via OleLoad.

You must sign in to leave a comment.