A high-availability system must have the ability to identify and correct errors, exceptions and failures in a timely and reliable manner to meet challenging service level objectives. The Teradata database and the utilities and components (used to both load and access data) provide capabilities to implement reliable error and exception handling functionality. These capabilities combined with a well designed high availability architecture allow a Teradata Active Enterprise Intelligence (AEI) system to meet the service level objectives required to support mission critical business processes.

This series of articles focuses on the error handling functionality and restart capabilities of the Teradata database and the Teradata data load utilities.

Examples of error and restart handling for the following utilities are included in this series:

  • Fastload
  • Multiload
  • Tpump
  • Teradata Parallel Transporter (TPT)
    • Load operator
    • Update operator
    • Stream operator

In most cases a load utility job for each of these utilities can be re-started after a database failure and reset, a load process error or failure or a load utility client platform failure after the underlying problem has been corrected.

Active Enterprise Intelligence Ecosystem

An Active Enterprise Intelligence Ecosystem consists of not only the Teradata database subsystem but all of the surrounding subsystems that require service from the database subsystem and provide services to the database subsystem. The following diagram illustrates the major subsystems in an AEI Ecosystem.

Active Enterprise Intelligence Ecosystem

Each subsystem must have error and failure handling capabilities. This series covers the capabilities and approaches to handle database, data integration and application integration errors and failures. This series will focus on handling errors, exceptions and failures in a Data Integration subsystem that is implemented with Teradata load utilitities.

Errors, Exceptions and Failures

A system can encounter multiple types of errors and failures. Error and failure handling processes must execute the following steps:

  1. Identify error or failure condition
  2. Classify error or failure
  3. Log error or failure
  4. Notify appropriate personnel
  5. Fix error or failure condition
  6. Return system to service
  7. Restart or re-execute operation that was executing when error or exception occurred.

The first step is to identify that an error or failure has occurred. In most cases a failure of the database will be detected by either a data load client or a data access client during a database operation. In this case the error must be classified and reported by the database client. In other cases the client may fail during a database operation. In each case the error must be classified.

Errors in the database or a database client can be classified as:

  1. System Error – System errors are an error or failure in the processing platform, storage or system software. For example, database amp failures, node failures, disk failures or storage cabinet failures would be system failures. In most cases the Teradata database will isolate the failed components and can be restarted. In many cases a database operation can be restarted where it was when the error occurred.
  2. Configuration Error – A configuration error is caused by a system configuration setting and can be corrected by changing or adjusting the configuration. For example, the database may report an out of space error when the perm space or spool space does not provide sufficient storage for a database operation. In many cases this type of error can be corrected and the database operation can be restarted.
  3. Programming Error – Programming errors represent errors in data load scripts or programs that are executing data access logic. In most cases the error is detected before a database operation has interacted with the database. In this case the programming error can be corrected and the operation can be re-run.
  4. Data Error – A data error or exception represents an error in the structure or value of data that is used to query or update a database. For example an error may be reported when a query to return a row based on a primary key can not find the associated row or an insert attempts to insert a duplicate row. In this case, the data values must be corrected and the database operation re-executed.
  5. Application Logic Error – An application logic error represents a violation of application logic during a database operation. In most cases the conditions that cause the error must be corrected and the operation must be re-executed. For example, a database interaction may attempt to increase a salary attribute by 25%. The application logic may restrict increases to 20%. This would cause an application logic error.

The ability to restart Teradata load utility operations that are interrupted by an error or exception is a key feature of these utilities. Care must be taken to identify the error properly to determine whether a restart is possible and to configure the utility scripts properly for restarts.

It is critical that errors and exceptions be identified and classified. This is necessary to determine whether the problem can be isolated and the system returned to service automatically using Teradata high availability features or if the error requires intervention. In the case of intervention the type of error will determine whether the intervention is by a DBA, programmer or Teradata support personnel.

Teradata load utilities have robust error handling and restart capabilities. This allows a database operation that is interrupted by a system error or by excessive data errors to be restarted and not have to be completely re-executed. A data integration operation may be interrupted by database errors, script errors, data errors or failures of a client process or platform. The Teradata load utilities provide the capability to restart a load operation in all of these cases.

Data integration operations are usually implemented as jobs that contain a series of steps. The steps may be defined in a single Teradata load utility script or as series of steps that execute load utilities or SQL scripts. In each case the job should be step restartable. That means that after an error the job should be able to be restarted at the point or step that was executing when the error occurred. In addition, each step should be able to be restarted or re-run. A restarted step will not re-execute the database interactions that occurred before the error. A step that is re-run will re-execute all database interactions in that step and continue with succeeding steps.

Jobs can be defined and executed using job scheduling and control systems (eg. Control-M, Tivoli, AutoSys), data integration systems (eg. Informatica, Oracle ODI, DataStage), or with scripts (eg. shell scripts, perl). In each case the job scheduling and control system should allow a job to be started at a step that was executing when an error occurred and it should allow a step to either be restarted or re-run.

Teradata utilities allow steps defined within the utility scripts to start at the step where an error occurred or to be rerun completely. They also allow a job interrupted by a database reset to continue automatically once the database is back in service.

The utilities allow checkpoints to be established to prevent the re-execution of database operations that have completed before an error caused an interruption.

Each of the load utilities allows parameters to be set that control:

  1. Tenacity – Tenacity define the length time an execution of a utility will attempt to establish a database connection
  2. Checkpoint interval – A checkpoint interval defines the frequency (either in time or records processed) when actions by the utility are marked as successful. This allows a restart to continue after the last successfully action
  3. Error Limit – The Error Limit parameter defines the maximum number of errors (usually data errors) that may occur before the load utility terminates due to errors.

All of the Teradata load utilities provide a return or completion code to the operation system or program that executed the utility. All job control definitions should check the return codes and take the appropriate action based on the return code. The Teradata utilities return the following completion codes:

  • 0 – Normal completion. All steps executed successfully.
  • 4 – Warning. A warning condition occurred; for example, a job deviated from normal or from the specified plan, but still completed successfully. A warning may indicate deviation from the plan; for example, the number of sessions specified were not actually used, or a part of the job did not run. Warning conditions do not terminate the job.
  • 8 – User error. A user error is usually a programming or configuration error in the load utility script. User error conditions terminate the job. In most cases the programming or script error can be corrected and the load utility job can be re-started.
  • 12 - Severe error. A fatal error terminated the job. A fatal error is any error other than a user error. Severe errors terminate a load utility job. In most case the error can be corrected and the load utility restarted at the point where the failure occurred.

In each case the specific database error must be found in the output log of the load utility to determine the next course of action (fix, re-start, re-run). A job control system can parse the output logs to determine and classify the database errors (beyond the scope of this series of articles). In all cases the underlying error must be identified, classified, fixed and then the job restarted at the appropriate step.

In the following series of articles we will describe how to use the features of Teradata load utilities to handle database and client errors and failures.