All Forums Tools
atulgupta01 9 posts Joined 12/14
16 Jan 2015
TPT - source data enclosed with double quotes and delimited by pipe (|)

Hi,
 
I am new to Teradata. I am trying to load the data from csv file to Teradata 13.0 on VMWARE using TPT, where in columns are enclosed with double quotes (") and delimited with pipe ( | ).
 
e.g.
 "1001"|"Amit"|"FIN"|"10001"|"1981-10-10"

"1002"|"Manish"|"FIN"|"25000"|"1990-01-01"

"1003"|"Sid"|""|"2000"|"1986-04-22"

"1004"|"Macho"|"GM"|""|"1972-05-19"

 

And I am using the below script - 

 

 

DEFINE JOB LOAD_M1_TABLE_FROM_FILE

DESCRIPTION 'LOAD SAMPLE M1 TABLE FROM A FILE'

(

  DEFINE SCHEMA M1_SCHEMA

  DESCRIPTION 'SAMPLE M1 SCHEMA'

  (

   EMP_ID   varchar(12),

   EMP_NAME varchar(12),

   EMP_DEPT varchar(5),

   SALARY   varchar(12),

   DOB      varchar(12)

  );

 

  DEFINE OPERATOR DDL_OPERATOR()

  DESCRIPTION 'TERADATA PARALLEL TRANSPORTER DDL OPERATOR'

  TYPE DDL

  ATTRIBUTES

  (

  VARCHAR PrivateLogName = 'tpt_script_ddloper_log',

  VARCHAR TdpId          = 'dbc',

  VARCHAR UserName       = 'dbadmin',

  VARCHAR UserPassword   = 'dbadmin',

  VARCHAR AccountID,

  VARCHAR ErrorList      = '3807'

  );

 

  DEFINE OPERATOR LOAD_OPERATOR()

  DESCRIPTION 'TERADATA PARALLEL TRANSPORTER LOAD OPERATOR'

  TYPE LOAD

  SCHEMA M1_SCHEMA

  ATTRIBUTES

  (

   VARCHAR PrivateLogName    = 'tpt_script_loadoper_privatelog',

   INTEGER MaxSessions       =  32,

   INTEGER MinSessions       =  2,

   VARCHAR TargetTable       = 'M1',

   VARCHAR TdpId             = 'dbc',

   VARCHAR UserName          = 'dbadmin',

   VARCHAR UserPassword      = 'dbadmin',

   VARCHAR AccountId,

   VARCHAR ErrorTable1       = 'tpt_script_LOADOPER_ERRTABLE1',

   VARCHAR ErrorTable2       = 'tpt_script_LOADOPER_ERRTABLE2',

   VARCHAR LogTable          = 'tpt_script_LOADOPER_LOGTABLE'

  );

 

  DEFINE OPERATOR FILE_READER()

  DESCRIPTION 'TERADATA PARALLEL TRANSPORTER DATA CONNECTOR OPERATOR'

  TYPE DATACONNECTOR PRODUCER

  SCHEMA M1_SCHEMA

  ATTRIBUTES

  (

   VARCHAR PrivateLogName    = 'tpt_script_dataconnoper_reader_privatelog',

   VARCHAR FileName          = 'code*.txt',

   VARCHAR IndicatorMode     = 'N',

   VARCHAR OpenMode          = 'Read',

   VARCHAR Format            = 'delimited',

   VARCHAR NullColumns       = 'Y',

   VARCHAR OpenQuoteMark = '"',

   VARCHAR CloseQuoteMark = '"',

   VARCHAR TextDelimiter     = '|',

   VARCHAR trimchar          = '"',

   VARCHAR MultipleReaders   = 'Y',

   varchar trimcolumns       = 'both',

   VARCHAR SkipRowsEveryFile = 'Y',

   INTEGER SkipRows = 1

  );

 

  STEP setup_tables

  (

    APPLY

    ('DROP TABLE M1;'),

    ('DROP TABLE M1_LOADOPER_ERRTABLE1;'),

    ('DROP TABLE M1_LOADOPER_ERRTABLE2;'),

    ('DROP TABLE M1_LOADOPER_LOGTABLE;'),

    ('CREATE TABLE M1(EMP_ID   varchar(10), EMP_NAME varchar(10), EMP_DEPT varchar(3), SALARY   varchar(10), DOB varchar(10));')

    TO OPERATOR (DDL_OPERATOR() );

  );

 

  STEP load_data_from_file

  (

    APPLY

     ( 'INSERT INTO M1(:EMP_ID, :EMP_NAME, :EMP_DEPT, :SALARY, :DOB);')

     TO OPERATOR (LOAD_OPERATOR() [1] )

 

     SELECT * FROM OPERATOR (FILE_READER() [1] );

 

  );

 );

 

 

 

 

 

 

 

 

Data is getting loaded but double quotes are getting included in data.

 

 

In Teradata data is as below - 

 

EMP_ID      EMP_NAME    EMP_DEPT  SALARY      DOB

----------  ----------  --------  ----------  ----------

"1004"      "Macho"     "GM       ""          "1972-05-1

"1002"      "Manish"    "FI       "25000"     "1990-01-0

"1007"      "Sidhu"     ""        "2000"      "1986-04-2

"1008"      "Mach"      "GM       ""          "1972-05-1

"1003"      "Sid"       ""        "2000"      "1986-04-2

"1006"      "Mani"      "FI       "25000"     "1990-01-0

 

 

If column length is less, then the data is trimmed before getting loaded. e.g. column EMP_DEPT and DOB.

 

 

Can you please help me out urgently. I am stuck up on this for last 3 days.

 

 

Regards

Atul

 

 

feinholz 1234 posts Joined 05/08
16 Jan 2015

You must enable the quoted data feature by setting the QuotedData attribute in the DataConnector attribute to 'yes' or 'optional'.
 
Please refer to the documentation for details.
 

--SteveF

atulgupta01 9 posts Joined 12/14
18 Jan 2015

Hi Steve,
 
Thanks for the reply. However, I had tried that option as well by specifying below -
 
VARCHAR QuotedData = 'Y',
 
and also I tried - 
 
VARCHAR QuotedData = 'Optional'
 
but none is working. 
 
Please help.

Regards
Atul

feinholz 1234 posts Joined 05/08
19 Jan 2015

Sorry, I forgot to ask what version of TPT you are using.
The QuotedData feature was introduced in TPT 14.0.
If you are using a version prior to that, then the quotes will always be part of the data.

--SteveF

atulgupta01 9 posts Joined 12/14
19 Jan 2015

I am using Teradata 13.0 express edition on VMWARE.
 
Is there any other way to get the desired result?
 
Thanks
Atul

feinholz 1234 posts Joined 05/08
19 Jan 2015

The only way is to upgrade to a newer version of TPT.
(TPT 13.0 is not a supported release anymore.)
Or pre-process the data and strip the quotes.
 

--SteveF

atulgupta01 9 posts Joined 12/14
20 Jan 2015

Also, let me know if TPT 13.0 version supports the .gz (zipped files).
 
Can you please provide any link to compare the features/differences between 13.0 and 14.0 versions.
 
Thanks
Atul

feinholz 1234 posts Joined 05/08
20 Jan 2015

Compress file supprot was implemented in 14.10.
 

--SteveF

atulgupta01 9 posts Joined 12/14
29 Jan 2015

Hi Steve,
 
I installed 14.1 and was able to load the file with comma seperated fields, optionally enlcosed with double quotes through TPT
sample data - 
 
123, "Manish","FINANCE",10000,25
12, "James","HR",5000,30

 
 
Can you please help me below - 
 
1. Is there any other method (other than TPT) which also support - comma separated data, and optionally enclosed with double quotes, like example I specified above.
2. In TPT, I am not sure, but there is limitation that I can load the data only to varchar fields in the table, but what if I have number/date fields as well.
3. When I am using FLOAD, I can specify - SET RECORD VARTEXT DELIMITER ';' QUOTE YES '"';
   but issue is that every column data must be enclosed with "  but in my data it is optional.   Is there any way to get it done?
4. I am using 14.1 express edition on vmware. There the no. of Instance is - 1, AMP - 2, and the maximum sessions that can be called - 2.     Is there any way I can increase the number of AMP's?   And if Yes, what is the way to do the same (All steps)
 
Thanks in Advance
Atul

atulgupta01 9 posts Joined 12/14
29 Jan 2015

5. Also, Can you please let me know whether we can skip record (HEADER) with FASTLOAD. RECORD 2 is not working.
6. Can we use TDLOAD to load the above type of data - comma separated with optionally enclosed with " and also need to skip the first row (header).

feinholz 1234 posts Joined 05/08
29 Jan 2015

1. TPT is the load/unload product that we would like everyone t use.
2. The table can have any data types as long as a conversion from VARCHAR to that data type is supported by Teradata. There is no limitation that the column data types of the table being loaded with delimited data has to be all VARCHAR. (The schema in the script must be all VARCHAR, but that is not the same thing.)
3. We prefer you not to use FastLoad. Please use TPT instead of FastLoad. FastLoad is a capped utility and no new features are being added to that utility.
4. I am not familiar with the express version of Teradata on VMWare. If the installation only allows 2 AMPs then that can probably not be increased. You should check the documentation to see if there is a method for changing the configuration.
5. FastLoad does not allow skipping a header row in a file with delimited records. TPT does. Please refer to #3.
6. Yes, it looks like TDLOAD will support the "Quoted Data" feature. You will need to specify a job variable file on the command line and in that file, provide a value for the pre-defined job variables for the feature. For quoted data, the job variable name is DCPQuotedData. For skipping a header row (or set of header rows), the job variable name is DCPSkipRows (provide a number for the number of rows to skip).
 
TDLOAD (also known as a feature of TPT called EasyLoader) uses operator templates under the covers. These templates provide the pre-defined job variables to be specified by the user. If you look in the install directory for TPT you will see a directory called "templates". In there you can take a look at the templates for all operators and look up the attributes that are supported by the operator and the corresponding job variable names to be specified when using the TDLOAD feature.
 
Hope this helps!
 

--SteveF

atulgupta01 9 posts Joined 12/14
03 Feb 2015

Hi Steve,
 
Thanks for answering all my queries. And sorry for bothering again n again.
 
I have few more queries - 
1. In TPT script, why we require - (The schema in the script must be all VARCHAR, but that is not the same thing.) . When we are specifying all the datatypes in the table creation. Why we cannot define the other datatypes(other tha VARCHAR) in the schema? Is it that TPT reads/understand data from file as VARCHAR only, and then convert it into the required datatypes while loading the data?
2. Does the columnar data storage in itself is compression? Or Compresion is something different.
3. Does compression only support maximum of distinct values of 256. And is it require to specify those values? Why TD in itself find and compress the distinct values in that column.
4. It will be great if you can give me some link or steps/process to increase the no. of AMP's. I would like to try it on my VMWARE version.
 
Regards
Atul
 

feinholz 1234 posts Joined 05/08
03 Feb 2015

With delimited data (also known as CSV data), the data is all character, and the fields are varying in length. Thus, by definition, the data is VARCHAR.
 
If the user wants the utility to read in "integer", or "decimal", then that numeric data is provided in some type of binary format.
 
The schema and the table definition have nothing to do with each other (directly, as far as data processing is concerned). We will send VARCHAR data to the DBS and Teradata will convert to the data types of the columns of the target table. TPT does not perform any data conversions.
 
Compression is handled inside the database and not by any of the client load/unload products.
 
I am not familiar with any process to increase the number of AMPs on your VM. You would have to get that information from some other source (if it is even possible).
 

--SteveF

atulgupta01 9 posts Joined 12/14
19 Feb 2015

Hi Steve,
 
I have been trying different loading techniques to load data of around 25M, to a table with around 110 columns. 
 
I tried with TPT script - TBUILD using FASTLOAD mainly. I have several queries below if you can help me out in understanding TERADATA more -
 
1. Loading data to a table with all varchar columns (same data) was slower than the table with proper datatypes - say integer, float, character, smallint etc.   -- Any specific reason can be? Is it the correct behaviour or I am doing something wrong? As from my perspective data conversion will take place in case of table with proper datatypes, and hence should take more time but this is not the case in actual.
 
2. Loading data from zipped files (around 50 csv zipped files) took longer time than unzipped data files?  Is it correct behaviour, if yes reason.
 
3. I tried loading same above data files - 25M records (50 files) using TPT - MULTILOAD and TPUMP, but both were getting hanged, even after 6-8 hrs. Whereas the FASTLOAD was averaging around 2.0 - 2.5 hours. Is it normal. Cant we load huge files/data using MULTILOAD and TPUMP? What are the best no. of files/rows/data for MULTILOAD and TPUMP. 
 
4. Is there any way/option through which I can load all the data to only one AMP and the data is not distributed to multiple AMP's? Will it give a better loading performance? Also will it give better performance if this is staging area and then we have to INSERT INTO SELECT to target table using this table (which is 2 step process).  
 
5. Is it good to load data using one session or is it good to load data with multiple sessions? Only one session is allowed per AMP?
 
6. Why secondary indexes are not allowed on a table while loading the data?
 
7. I was using the flushing parameter in TPT command TBUILD ( -z 120). It actually made the load slower drastically. Any specific reason. Is it good to use or not. Does it flushes the data which is read from flat files to Database blocks or is it something else.
 
8. Why load to a columnar table (or column partitioned table) is not allowed? Are there any restrictions in the architecture of Teradata when teradata is competing with many in the industry?
 
9. While creating a table, why columnar partitioning is not allowed with Primary Index (PI).
 
10.  According to Teradata, PI is automatically created using first column when we dont define the PI in CREATE Table statement.  But when I tried to create a table (row partitioned) with below syntax - 
 
create table on_time_performance(
..
..
)

PARTITION BY ( CASE_N( Year1 = 2009,

                       Year1 = 2010,

                       Year1 = 2011,

                       Year1 = 2012,

                       Year1 = 2013,

                       Year1 = 2014,

                       NO CASE, UNKNOWN )

);

 

It gave me below error - *** Failure 3732 The facility of no primary index with row partitioning but no column partitioning has not been implemented yet.

 

Did it not create any PI automatically? and if it does why above error was thrown?

 

11. When we use - delete from tduser.on_time_performance_d all;  

is it equivalent to TRUNCATE in Oracle? Will it also change the high-water-mark.

 

12. Each AMP is like a separate node which work independently with its own resources?

 

 

Regards

Atul

atulgupta01 9 posts Joined 12/14
26 Feb 2015

Hi Steve,
Can you please help me out on above.
 
Regards
Atul

feinholz 1234 posts Joined 05/08
27 Feb 2015

I will not be able to address all of the questions, but will try my best:
 
1. when you say you are loading a Teradata table defined with all VARCHAR fields, does that mean you defined your table that way because your input data is delimited and thus treated as all VARCHAR? Or was this was just an experiment to test performance with different table schemas? Also, when you say the performance is slower, do you mean from the client side or the DBS side?
 
If you are asking about loading performance with delimtied data, it is slower than reading in (binary) data with native data types because of the parsing involved in processing delimited records.
 
If you are asking about Teradata RDBMS performance, I cannot comment on that.
 
2. Loading data from zipped files will not be as performant as unzipped files because of the time it takes to use the 3rd party tools to process the zipped files (probably some "unzipping" involved under the covers).
 
3. I cannot comment on why the Update or Stream operators would hang. That is a separate issue by itself that would have to be researched and diagnosed. TPT has no problems loading any size data file using any of the operators. If you would like information about Stream operator tuning parameters, that is an entirely separate (and long) subject and not appropriate for this particular thread.
 
4. I am not sure why you would want all of the data from a single file to go to a single AMP (data skew), but it is possible if you load to a NoPI table and use only one session in the job to load the data. Loading to a NoPI table is generally faster than loading to a PI table, but with only one session, I am not sure that the performance will be better than loading with multiple sessions (to a PI table).
 
5. Ths whole point about improving performance is about parallelism. Loading with multiple sessions will always be better than loading with one session. When using the Load and Update operators, the max limit on sessions is one per "available" AMP.
 
6. Secondary indexes are not allowed when using the Load and Update operators only. You can always use the Stream (or SQL Inserter) operator to load data (through normal SQL sessions) to tables with secondary indexes. It is all about performance.
 
7. The -z command line option is not for "flushing", it is for checkpointing. If you would like to create checkpoint intervals so that if the DBS restarts, you do not have to restart from the very beginning, then you use the -z option. Checkpointing is an expensive action, though. The entire load process has to come to a stop while the data in the shared memory buffers are completely sent to the DBS, followed by a checkpoint set of actions with the DBS. How often you checkpoint is dependent upon how much time you are willing to lose (to re-send data that was not checkpointed). It also may depend on the amount of data you are loading. Everyone's use of checkpointing is different. I know some customers who are loading transactional dataand checkpoint every few seconds (they want the data available sooner) and there are some that might checkpoint every 15 or 30 minutes on very long running jobs.
 
8. Loading to columnar tables is allowed with TPT operators that use normal SQL sessions (Stream, SQL Inserter).
 
9. I cannot answer this question. It is a DBS internal issue and not my area.
 
10. I cannot answer this question. It is a DBS issue and not my area.
 
11. I do not know the answer to this question.
 
12. No, each AMP is not like a separate node. But it does work independently and has its own resources.
 
 

--SteveF

13 Mar 2015

Can someone share me a ppt regarding basics of Teradata. with diagrams will be more use..
 

You must sign in to leave a comment.