All Forums Hadoop
HanifB 1 post Joined 07/13
24 Nov 2015
TDCH Import command - Export Data from Teradata to HADOOP

Please can you help and advise on the following:-
We are in the process of exporting data from Teradata to Hadoop using the TDCH (Teradata Connector for Hadoop). We have managed to successfully import full tables from Teradata to Hadoop, however where the Teradata tables are > 200GB we want to only import the delta's on a daily basis.
We have changed our script to use SOURCEQUERY in place of SOURCETABLE and supplied the SQL with the where clause which only selects a subset of the data based on the date processing.  We have also specified the method as, however this is being overridden by when using the SOURCEQUERY parameter.
Using causes the staging table to be created in the DB area which is the full size of the exisitng table, this is causing us issues since we do not have the spare space of replicating the table in the Databse Area and therefore our job abends with "2644 - No more Room in Database".
Please can anyone help why the method cannot be used when using SOURCEQUERY ?
This is a smaple script just to show the parameters we have used to invoke the IMPORT process using TDCH.
hadoop jar /usr/lib/sqoop/lib/teradata-connector-1.4.1.jar  terajdbc4.jar \
com.teradata.hadoop.tool.TeradataImportTool \
-D mapreduce.job.queuename=insights \
-url jdbc:teradata://tdprod/database=insight \
-username xxxxxxxx \
-password xxxxxxxxx  \
-classname com.teradata.jdbc.TeraDriver \
-fileformat textfile \
-splitbycolumn address_id \
-jobtype hdfs \
-method \
-targetpaths hdfs://dox/user/user1/td_prd_addresses/ mail_drop_date='2015-11-20'/ \
-nummappers 1 \
-sourcequery "select col1, col2, col3, col4 from EDWPRODT.ADDRESSES where mail_drop_date = '2015-11-20'"

td_admirer 36 posts Joined 07/12
14 Dec 2015

When importing data into Hadoop from Teradata, if you were to use a "sourcetable" option in your TDCH job, you have the ability to use either, or even The moment you use the "sourcequery" option, TDCH would create a temp table in the defualt db and use the even though you might have specified other import method. This is done for performance reasons.
In, each mapper starts a session for a subset of partitions from the created partitioned source staging table with a partition key the same as the distribution key. In this method, the "split by" option is mute and cannot be used. You might have also noticed that it Random fucntion. For data export from queries, this is the fastest method of data transfer.
With that said, if you do not want the job to create a staging table, you would have to create a bteq job which loaded the data from the SQL into some db which has enough space and then use the "sourcetable" option and can use the or
Hope this helps.

vaishakachar 1 post Joined 03/16
08 Mar 2016

When you use the "sourcequery" and the number of mappers > 1, by default it will take
You can force the number of mappers to 1 (--num-mappers 1). This would force it to opt for This method will not create a temporary table and your problem should be solved. would check min(--split-by columnA) and max(--split-by-columnA) and then distribute it to the mappers, which her would be one. 
Vaishak S Achar

You must sign in to leave a comment.