In database multiple variable linear regression using the CM_Solve Table Operator
Attachment  Size 

cm_solve.zip  12.15 KB 
cm_solve_algo.zip  224.42 KB 
In a prior article [1] we described how to use the Teradata 14.10 CalcMatrix operator and R to perform a multiple variable linear regression analysis. This article extends that concept with a comprehensive in database solution by introducing a new in database table operator named “CM_Solve”. This approach has value in cases when you want to solve a large number of independent systems of equations or you simply do NOT want to use the R client for solving the system of equations based on the SSCP matrix.
CM_Solve Table Operator
CM_Solve is a user defined table operator that needs to be installed on the system prior to use. It takes as input the output from the CalcMatrix table operator and returns the solution coefficients for the linear model that best fits the input system of equations. The following describes the create DDL and the invocation DML.
CM_Solve DDL
The create DDL is as follows:
REPLACE FUNCTION CM_Solve() RETURNS TABLE VARYING USING FUNCTION cm_solve_contract LANGUAGE CPP NO SQL PARAMETER STYLE SQLTABLE EXTERNAL NAME 'CS!cm_solve!./cm_solve.cpp'
The source code for CM_Solve is in the attached cm_solve.cpp file. The performance of CM_Solve can be enhanced by using the –O3 gcc compiler option. As you can review in the performance testing section CM_Solve is usually only a small component of the overall processing time. Nonetheless in cases when required, to use the compiler options you need to alter the create DDL to use the object code create method. Follow these general steps.
1: compile the source code into an object file on a Linux system with similar OS software version (SLES10, SLES11) as the target Teradata system OS version. Gcc compile command
gcc fpic I /usr/tdbms/etc g O3 c ./cm_solve.cpp
2: Modify the create DDL to specify that the table operator will be created from object code.
REPLACE FUNCTION cm_solve() RETURNS TABLE VARYING USING FUNCTION cm_solve_contract LANGUAGE CPP NO SQL PARAMETER STYLE SQLTABLE EXTERNAL NAME 'CO!cm_solve!/tmp/cm_solve.o'
CM_Solve DML
CM_Solve always operates in a global mode so you must specify a partition attribute, even if there is only a single system of equations to be solved. CM_Solve processes the output of the CalcMatrix LOCAL phase and combines the per AMP matrix subtotals into a single matrix and then uses either the Cholesky or SVD algorithms to solve the system of equations. The input to CM_Solve consists of one or more matrix identifier columns and 1 or more independent variable columns and a single dependent variable column. The matrix identifier column(s) must be placed in the USING MATRIX_IDS using clause and the last input column always designates the dependent variable.
CM_Solve USING clause definitions:
 MATRIX_IDS: A list of one or more input column names that have been used to partition the input data.
 CALCTYPE: The calculation type to be performed, currently only ‘solve’ is supported.
The LOCAL phase of CalcMatrix, USING PHASE('LOCAL'), processes the input data that exists on each AMP and outputs rows of summarized data for that AMP. In the case of multiple system of equations the input to the LOCAL phase must consist of one or more matrix identifier columns and 1 or more independent variable columns and a single dependent variable column. The matrix identifier columns must be placed in the LOCAL ORDER BY list so that the data arrives ordered by system of equations identifiers.
Example invocation with a single system of equations:
SELECT * FROM CM_Solve(  invoke the CM_Solve table operator ON ( SELECT 1 as partno, d1.* FROM CalcMatrix(  invoke the CalcMatrix table operator ON( SELECT * FROM bigcolumns)  Select all of the columns and rows from the input table USING PHASE('LOCAL')  designate the CalcMatrix phase ) as D1 ) PARTITION BY partno  partition by partno, a constant USING MATRIX_IDS('partno')  define the matrix identifier columns CALCTYPE('solve')  define the calctype ) D2
Example invocation with multiple systems of equations:
SELECT * FROM CM_Solve(  invoke the CM_Solve table operator ON ( SELECT * FROM CalcMatrix(  invoke the CalcMatrix table operator ON( SELECT * FROM bigdata)  Select all of the columns and rows from the input table LOCAL ORDER BY partno  order the ON clause data by “partno” column USING PHASE('LOCAL')  designate the CalcMatrix phase ) as D1 ) PARTITION BY partno  partition the ON clause data by “partno” column USING MATRIX_IDS('partno')  define the matrix identifier columns(s) CALCTYPE('solve')  define the calctype ) D2
See appendix A for details on the CM_Solve implementation.
Correctness Testing
To validate the correctness of CalcMatrix and CM_Solve we compared the results to the R software packages. Specifically we used the R LM function result values for the model coefficients and the coefficient of determination. The coefficient of determination, called R Squared, indicates how well the input data fits the model. It provides a measure of how well observed outcomes are described by the model. R squared ranges in value from 0 to 1 with lower values meaning the linear model does not have a good fit for the input data.
We used the R interpreter to control execution of 500 tests with 1 to 500 independent variables with a random input observation count between 1000 and 4000 values. We used the Teradata RODBC package and the Teradata ODBC driver to submit the queries to the Teradata database. When comparing for correctness we used a per value error tolerance of 1e8. For example the following R code fragment would check for an error in the R Squared value with a tolerance of 1e8
DELTA < 1e8 delta < abs ( (RRsquared  TRsquared) / RRsquared) if (delta > DELTA) { print (gettextf("R Squared Error, delta is %1.18f, R=%1.18f, Teradata=%1.18f",delta,RRsquared,TRsquared)) stop() }
An example output for 8 independent variables follows:
[1] "Loop 8, input data set size is 3386 " [1] "R Model  formula (y ~ x1+x2+x3+x4+x5+x6+x7+x8)" [1] "R Squared is 0.001982875583" [1] "R Model coefficients" [1] "4951.983628870" "0.006819239" "0.019203730" "0.028306168" [5] "0.014890097" "0.021459214" "0.010873410" "0.002893754" [9] "0.005675700" [1] "Teradata Model" [1] "R Squared is 0.001982875583" [1] "Teradata Model coefficients" [1] "4951.983628870" "0.006819239" "0.019203730" "0.028306168" [5] "0.014890097" "0.021459214" "0.010873410" "0.002893754" [9] "0.005675700"
Background on calculating R squared on Teradata. Assume the data has dependent variable values yi, each of which has an associated modeled value mi. The values yi are called the observed values and the modeled values mi are the predicted values. Also assume "y bar" is the mean of the y values. The general definition of R Squared, the coefficient of determination is
The SQL representation to calculate R Squared, with the 8 variable test case coefficients substituted in the modeled value expression, would be as follows:
SELECT 1 – // SSres ( SUM((y  (4951.98362886991 + x1 * 0.006819238823883900 + x2 * 0.019203730297631599 + x3 * 0.028306168031164002 + x4 * 0.014890097413664800 + x5 * 0.021459213638044299 + x6 * 0.010873410360648900 + x7 * 0.002893753713620240 + x8 * 0.005675700085017590))**2) / //SStot SUM ((y (SELECT AVG(y) FROM indata8))**2) ) FROM indata8
As a final test we did a visualization comparison between a single independent variable R model and a Teradata model
[1] "Loop 1, input data set size is 2071 " [1] "R Model  formula (y ~ x1)" [1] "R Squared is 0.582239089150" [1] "R Model coefficients" [1] "241.021259378" "0.760719121" [1] "Teradata Model" [1] "R Squared is 0.582239089150" [1] "Teradata Model coefficients" [1] "241.021259378" "0.760719121" [1] "Compare Models ... "
Teradata model regression line in Orange and the R model dashed line in blue is overlaid
Performance Testing
We choose to test two use cases A) a large number of independent systems of equations with a small number of variables and B) a single system of equations with a large number of variables. In both test cases CM_Solve was executed in protected mode.
In test case A we solve 100,000 systems of equations where each system consists of 1000 rows with 24 independent variables. The input table has 100,000,000 rows and the test was performed on a two node 6750 running Teradata database release 14.10.
Input data table DDL:
CREATE TABLE bigdata (partno integer ,x1 float, x2 float, x3 float, x4 float, x5 float, x6 float, x7 float, x8 float, x9 float, x10 float ,x11 float, x12 float, x13 float, x14 float, x15 float, x16 float, x17 float, x18 float, x19 float, x20 float ,x21 float, x22 float, x23 byteint, x24 decimal(18,2) ,y float ) PRIMARY INDEX (partno);
Query:
In the case of multiple system of equations the input to the LOCAL phase must consist of one or more matrix identifier columns (in this case the single partno column) and 1 or more independent variable columns (in this case x1 – x24) and a single dependent variable column (in this case Y). The matrix identifier columns must be placed in the LOCAL ORDER BY list so that the data arrives ordered by system of equations identifiers.
In the case of multiple system of equations the input to CM_Solve consist of a varying matrix identifier columns (in this case the partno column) and 1 or more independent variable columns and a single dependent variable column . The matrix identifier columns must be placed in the USING MATRIX_IDS using clause and the last input column is ways designated the dependent variable.
SELECT * FROM CM_Solve(  invoke the CM_Solve table operator ON ( SELECT * FROM CalcMatrix(  invoke the CalcMatrix table operator ON( SELECT * FROM bigdata)  Select all of the columns and rows from the input table LOCAL ORDER BY partno  order the ON clause data by “partno” column USING PHASE('LOCAL')  designate the CalcMatrix phase ) as D1 ) PARTITION BY partno  partition the ON clause data by “partno” column USING MATRIX_IDS('partno')  define the matrix identifier columns(s) CALCTYPE('solve')  define the calctype ) D2
Explain Text for the query
1)First, we lock a distinct MWW."pseudo table" for read on a RowHash to prevent global deadlock for MWW.bigdata. 2) Next, we lock MWW.bigdata for read. 3) We do an allAMPs RETRIEVE step from MWW.bigdata by way of an allrows scan with no residual conditions into Spool 1 (used to materialize view, derived table, table function or table operator TblOpInputSpool) (all_amps), which is built locally on the AMPs. Then we do a SORT to order Spool 1 by the sort key in spool field1. The size of Spool 1 is estimated with high confidence to be 501 rows (109,218 bytes). The estimated time for this step is 0.09 seconds. 4) We do an allAMPs RETRIEVE step from Spool 1 (Last Use) by way of an allrows scan executing table operator TD_SYSFNLIB.calcmatrix with a condition of ("(1=1)") into Spool 2 (used to materialize view, derived table, table function or table operator D1) (all_amps), which is built locally on the AMPs. The size of Spool 2 is estimated with high confidence to be 501 rows (209,418 bytes). The estimated time for this step is 0.09 seconds. 5) We do an allAMPs RETRIEVE step from Spool 2 (Last Use) by way of an allrows scan into Spool 3 (used to materialize view, derived table, table function or table operator TblOpInputSpool) (all_amps), which is redistributed by hash code to all AMPs. Then we do a SORT to order Spool 3 by the sort key in spool field1. The size of Spool 3 is estimated with high confidence to be 501 rows (213,426 bytes). The estimated time for this step is 0.10 seconds. 6) We do an allAMPs RETRIEVE step from Spool 3 (Last Use) by way of an allrows scan executing table operator MWW.cm_solve with a condition of ("(1=1)") into Spool 4 (used to materialize view, derived table, table function or table operator D2) (all_amps), which is built locally on the AMPs. The size of Spool 4 is estimated with high confidence to be 501 rows (112,725 bytes). The estimated time for this step is 0.09 seconds. 7) We do an allAMPs RETRIEVE step from Spool 4 (Last Use) by way of an allrows scan into Spool 5 (group_amps), which is built locally on the AMPs. The size of Spool 5 is estimated with high confidence to be 501 rows (291,582 bytes). The estimated time for this step is 0.09 seconds. 8) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request. > The contents of Spool 5 are sent back to the user as the result of statement 1. The total estimated time is 0.47 seconds.
DBQL Performance Data
From the explain you can see that CalcMatrix executes in step 4 and CM_Solve executes in step 6. From the DBQL it is obvious the CalcMatrix dominates the elapsed time and the CPU resource consumption. Also the original input 100,000,000 rows are reduced to 2,500,000 (100,000 systems of 25x25 matrices) rows by CalcMatrix and to the final 100,000 row solution coefficients by CM_Solve.
QueryID 
Elapsed 
sl1 
sl2 
StepName 
CPUTime 
IOcount 
RowCount 


1 
307,182,118,593,413,097 
00:00:00.000000 
1 
0 
MLK 
0.00 
0.00 
1.00 
2 
307,182,118,593,413,097 
00:00:00.000000 
2 
0 
MLK 
0.02 
0.00 
72.00 
3 
307,182,118,593,413,097 
00:00:04.010000 
3 
0 
RET 
225.90 
527,522.00 
100,000,000.00 
4 
307,182,118,593,413,097 
00:00:30.320000 
4 
0 
RET 
1,747.67 
53,747.00 
2,500,000.00 
5 
307,182,118,593,413,097 
00:00:00.260000 
5 
0 
RET 
8.24 
7,213.00 
2,500,000.00 
6 
307,182,118,593,413,097 
00:00:00.190000 
6 
0 
RET 
11.17 
3,468.00 
100,000.00 
7 
307,182,118,593,413,097 
00:00:00.020000 
7 
0 
RET 
0.94 
1,877.00 
100,000.00 
8 
307,182,118,593,413,097 
00:00:00.000000 
8 
0 
Edt 
0.12 
0.00 
72.00 
In test case B we solve a single systems of equations with 1,000,000 rows with 999 independent variables and a single dependent variable. The test was performed on the same two node 6750 running Teradata database release 14.10. It should be noted the computation complexity is proportional to (number of variables)^{2} * number row input rows
Query:
In this case the constant 1 is used as partno for CM_Solve
SELECT * FROM CM_Solve(  invoke the CM_Solve table operator ON ( SELECT 1 as partno, d1.* FROM CalcMatrix(  invoke the CalcMatrix table operator ON( SELECT * FROM bigcolumns)  Select all of the columns and rows from the input table USING PHASE('LOCAL') ) as D1 ) PARTITION BY partno USING MATRIX_IDS('partno') CALCTYPE('solve')  define the calctype ) D2
Explain Text for the query:
1) First, we lock a distinct MWW."pseudo table" for read on a RowHash to prevent global deadlock for MWW.bigcolumns. 2) Next, we lock MWW.bigcolumns for read. 3) We do an allAMPs RETRIEVE step from MWW.bigcolumns by way of an allrows scan with no residual conditions executing table operator TD_SYSFNLIB.calcmatrix with a condition of ("(1=1)") into Spool 2 (used to materialize view, derived table, table function or table operator D1) (all_amps), which is built locally on the AMPs. The size of Spool 2 is estimated with high confidence to be 49,990 rows (810,537,860 bytes). The estimated time for this step is 0.79 seconds. 4) We do an allAMPs RETRIEVE step from Spool 2 (Last Use) by way of an allrows scan into Spool 3 (used to materialize view, derived table, table function or table operator TblOpInputSpool) (all_amps), which is redistributed by hash code to all AMPs. Then we do a SORT to order Spool 3 by the sort key in spool field1. The size of Spool 3 is estimated with high confidence to be 49,990 rows (810,987,770 bytes). The estimated time for this step is 3.29 seconds. 5) We do an allAMPs RETRIEVE step from Spool 3 (Last Use) by way of an allrows scan executing table operator MWW.cm_solve with a condition of ("(1=1)") into Spool 4 (used to materialize view, derived table, table function or table operator D2) (all_amps), which is built locally on the AMPs. The size of Spool 4 is estimated with high confidence to be 49,990 rows (800,939,780 bytes). The estimated time for this step is 0.79 seconds. 6) We do an allAMPs RETRIEVE step from Spool 4 (Last Use) by way of an allrows scan into Spool 5 (group_amps), which is built locally on the AMPs. The size of Spool 5 is estimated with high confidence to be 49,990 rows (2,200,809,750 bytes). The estimated time for this step is 0.79 seconds. 7) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request. > The contents of Spool 5 are sent back to the user as the result of statement 1. The total estimated time is 5.67 seconds.
DBQL Performance Data:
From the explain you can see that CalcMatrix executes in step 3 and CM_Solve executes in step 5. From the DBQL it is obvious the CalcMatrix dominates the elapsed time and the CPU resource consumption. Also the original input 1,000,000 rows are reduced to 72,000 (1 systems of 1000x1000 matrix on each of the 72 AMPs) rows by CalcMatrix and to the final 1 row solution coefficients by CM_Solve. Note step 5 will be executed on a single AMP.
QueryID 
Elapsed 
sl1 
sl2 
StepName 
CPUTime 
IOcount 
RowCount 


1 
307,192,118,593,412,145 
00:00:00.000000 
1 
0 
MLK 
0.00 
0.00 
1.00 
2 
307,192,118,593,412,145 
00:00:00.000000 
2 
0 
MLK 
0.02 
0.00 
72.00 
3 
307,192,118,593,412,145 
00:01:24.650000 
3 
0 
RET 
5,324.49 
33,545.00 
72,000.00 
4 
307,192,118,593,412,145 
00:00:01.690000 
4 
0 
RET 
4.10 
7,562.00 
72,000.00 
5 
307,192,118,593,412,145 
00:00:03.550000 
5 
0 
RET 
4.23 
1,572.00 
1.00 
6 
307,192,118,593,412,145 
00:00:00.000000 
6 
0 
RET 
0.04 
236.00 
1.00 
7 
307,192,118,593,412,145 
00:00:00.000000 
7 
0 
Edt 
0.02 
0.00 
72.00 
Append A: CM_Solve algorithm details
See attached file cm_solve_algo.zip.
References:
[2] CalcMatrix: Chapter 26, SQL Functions, Operators, Expressions, and Predicates Release 14.10 B0351145112A May 2013