# Can We Speak Confidentially? Exposing Explain Confidence Levels

I gave a presentation at the Teradata Partners Conference last week on the fundamentals of collecting statistics, where I touched briefly on the rules behind confidence levels. I’m using this article to go into further depth, and offer you more examples of how these confidence levels come about. If you look at query plans much, I’m *confident* you’ll find something of interest in this article.

##
**The Mandatory Background Section (Skip if you’re Teradata-Knowledgeable)**

Confidence levels are determined by the optimizer at the time a query plan is built as an indication of how comfortable the optimizer is by the estimates it comes up with for a given step. High Confidence being the greatest comfort level, and No Confidence being the least.

Some steps don’t give you a confidence level because optimizing the step is a no-brainer. Below is an example of single-AMP using primary index access. Since access by primary index is automatic when the query provides a primary index value, no costing decisions need to be made and confidence levels are immaterial.

Here is an example of an all-AMP access step from a query plan that illustrate a confidence level typical of an all-AMP step:

EXPLAIN SELECT * FROM customer WHERE c_nationkey = 12;

3) We do an all-AMPs RETRIEVE step from CAB30G.customer by way of an

all-rows scan with a condition of ("CAB30G.customer.C_NATIONKEY =

12") into Spool 1 (group_amps), which is built locally on the AMPs.

The size of Spool 1 is estimated with **high confidence **to be

179,308 rows (20,620,420 bytes). The estimated time for this step

is 0.53 seconds.

Here is an example of a primary index access step with a missing confidence level:

EXPLAIN SELECT * FROM customer WHERE c_custkey = 73454;

1) First, we do a single-AMP RETRIEVE step from CAB30G.customer by

way of the unique primary index "CAB30G.customer.C_CUSTKEY =

73454" with no residual conditions. The estimated time for this

step is 0.00 seconds.

##
**The Confidence Levels**

There are 4 confidence levels for the optimizer to select from. The following graphic illustrates the pre-requisites for receiving each of these levels:

###
**High Confidence**

As shown above, primary index statistics must have been collected on the table being accessed in the step in order to receive High Confidence for that step. Further, there must only be a single predicate (single or multicolumn) with statistics collected on that predicate. There are two example examples below that illustrate this requirement.

For the two following examples, statistics were collected on the following:

- T_Txnkey (primary index)
- T_Quantity (column)
- T_Shipdate (column)

The following explain shows that High Confidence results when primary index stats have been collected AND there is a single predicate:

EXPLAIN SELECT * FROM txn WHERE t_quantity = 5;

3) We do an all-AMPs RETRIEVE…with a condition of ("CAB.txn.

T_QUANTITY = 5.00") into Spool 1…The size of Spool 1 is

estimated with **high confidence **to be 199,878 rows…

This next example illustrates that even with statistics collected on all referenced columns, Low Confidence is returned if there is more than one predicate on the table:

EXPLAIN SELECT * FROM txn WHERE t_quantity = 5 AND t_shipdate = 980101

3) We do an all-AMPs RETRIEVE … with a condition of

("(CAB.txn.T_SHIPDATE = DATE '1998-01-01') AND

(CAB.txn.T_QUANTITY = 5.00)") into Spool 1…The size of

Spool 1 is estimated with **low confidence **to be 3,106 rows. . .

If multicolumn stats had been collected on (T_Quantity,T_Shipdate), then the step would have qualified for High Confidence, since a multicolumn statistic is considered a single predicate. Note that for a multicolumn statistics to be used by the optimizer, all participating columns must be expressed in equality conditions within the query.

###
**Low Confidence**

For the vast majority of cases, the optimizer will cost a Low Confidence step the same as a High Confidence step. So there is little, if any, value in going to extreme lengths to achieve High Confidence steps. By the way, statistics are considered “collected” whether they are fully collected are or you have collected them taking advantage of the USING SAMPLE option. I just now ran the explain of the first query in this article after re-collecting statistics on c_nationkey specifying USING SAMPLE with a 2% sampling rate. I still received High Confidence for the step.

###
**Low Confidence and Join Steps**

Practically all join steps will give you Low Confidence. This is because the optimizer is not aware of what the match will be between the tables being joined. Below is a graphic that illustrates this point.

There is an exception to this. Under special circumstances a join step can receive High Confidence. The requirement is that a unique primary index is on one side of the join with stats collected, and there is a predicate in the WHERE clause that passes a single value for this unique primary index. An example follows:

EXPLAIN SELECT * FROM txn,part WHERE t_partkey = p_partkey <== UPI with stats, 1 row guaranteed AND p_partkey = 88;

3) We do an all-AMPs RETRIEVE step from CAB.txn by way of index # 4

"CAB.txn.T_PARTKEY = 88". . . Spool 2 is estimated with **high confidence**

to be 3 rows. . .

4) We do a single-AMP JOIN step from CAB.part by way of the

unique primary index "CAB.part.P_PARTKEY = 88“. . .joined to

Spool 2 (Last Use). . . The size of Spool 1 is estimated with** high confidence **to be 3 rows. . .

###

**Low Confidence and NUSI Selection with No Stats**

If selection on the table is by NUSI, even though there are no statistics collected on the NUSI, Low Confidence will result . However, there must only be a single NUSI used as a predicate. If there are multiple NUSI’s referenced, No Confidence is returned. Below are two examples that illustrate this:

EXPLAIN SELECT * FROM parttbl WHERE p_size = 5 ; <== NUSI but no statistics

3) We do an all-AMPs RETRIEVE step. . . by way of

index # 4 "CAB.parttbl.P_SIZE = 5" . . . Spool 1 is estimated with **low
confidence **to be 103,487 rows.

EXPLAIN SELECT * FROM parttbl WHERE p_size = 5 <== NUSI but no statistics AND p_type NE 'small plated tin‘; <== NUSI but no statistics

3) We do an all-AMPs RETRIEVE … by way of

index # 4 "CAB.parttbl.P_SIZE = 5" with a residual condition

of ("CAB.parttbl.P_TYPE <> 'small plated tin'") …

Spool 1 is estimated with **no confidence **to be 92,342 rows .

The reasoning here is that random AMP samples provide basic demographic information about all NUSIs on a table. Since that information is available even without statistics collection on the NUSI, a Low Confidence is acceptable for such single NUSI access.

##
**Trickle-Down Effect of No Confidence**

If you’ve got complex, multi-step plans, chances are when you look at the steps at the bottom of the plan, most of them display No Confidence. One contributing factor is that lower confidence levels coming from previous steps will dominate the confidence level of the steps they feeds into.

Below is a query and its explain that illustrate this point:

In this example, full statistics on all indexes and referenced columns have been collected on both the Lineitem and the Parttbl tables. However no statistics at all have been collected on the Supplier table.

The following table illustrates the confidence levels at each step of this 3 table join. Note that even though the Parttbl has all possible stats collected, the step that joins the Parttbl to the spool created by the join of the Lineitem and Supplier table shows No Confidence. That is because the step feeding into it has No Confidence because no stats had been collected on Supplier.

Here is the entire explain from which information in the above table was drawn.

EXPLAIN SELECT * FROM lineitem,parttbl,supplier WHERE p_partkey = l_partkey AND s_suppkey = l_suppkey AND p_size = 3 AND l_shipdate = 950101 AND s_nationkey = 6;

1) First, we lock a distinct CAB."pseudo table" for read on a

RowHash to prevent global deadlock for CAB.supplier.

2) Next, we lock a distinct CAB."pseudo table" for read on a

RowHash to prevent global deadlock for CAB.parttbl.

3) We lock a distinct CAB."pseudo table" for read on a RowHash to

prevent global deadlock for CAB.lineitem.

4) We lock CAB.supplier for read, we lock CAB.parttbl for

read, and we lock CAB.lineitem for read.

5) We do an all-AMPs RETRIEVE step from CAB.lineitem by way of an

all-rows scan with a condition of ("CAB.lineitem.L_SHIPDATE =

DATE '1995-01-01'") into Spool 2 (all_amps) (compressed columns

allowed), which is redistributed by the hash code of (

CAB.lineitem.L_SUPPKEY) to all AMPs. Then we do a SORT to

order Spool 2 by row hash. The size of Spool 2 is estimated with

high confidence to be 74,730 rows (9,415,980 bytes). The

estimated time for this step is 16.61 seconds.

6) We do an all-AMPs JOIN step from Spool 2 (Last Use) by way of a

RowHash match scan, which is joined to CAB.supplier by way of

a RowHash match scan with a condition of (

"CAB.supplier.S_NATIONKEY = 6"). Spool 2 and CAB.supplier

are joined using a merge join, with a join condition of (

"CAB.supplier.S_SUPPKEY = L_SUPPKEY"). The result goes into

Spool 3 (all_amps) (compressed columns allowed), which is

redistributed by the hash code of (CAB.lineitem.L_PARTKEY) to

all AMPs. Then we do a SORT to order Spool 3 by row hash. The

size of Spool 3 is estimated with no confidence to be 13,105 rows

(2,738,945 bytes). The estimated time for this step is 0.06

seconds.

7) We do an all-AMPs JOIN step from CAB.parttbl by way of a

RowHash match scan with a condition of ("CAB.parttbl.P_SIZE =

3"), which is joined to Spool 3 (Last Use) by way of a RowHash

match scan. CAB.parttbl and Spool 3 are joined using a merge

join, with a join condition of ("CAB.parttbl.P_PARTKEY =

L_PARTKEY"). The result goes into Spool 1 (group_amps), which is

built locally on the AMPs. The size of Spool 1 is estimated with

no confidence to be 13,105 rows (3,826,660 bytes). The estimated

time for this step is 0.47 seconds.

8) Finally, we send out an END TRANSACTION step to all AMPs involved

in processing the request.

##
**Conclusion**

Confidence Levels are associated to query plan steps in an effort to assess the confidence the optimizer has in the accuracy of the demographic information available for thatstep. For the most part, High Confidence and Low Confidence steps will be treated the same by the optimizer when it comes to making plan decisions. Even with good statistics collected, there will be cases where Low Confidence is delivered, especially for steps near the bottom of a complex query plan.

Dear Carrie

Allow me to be bit selfish in asking you an off-the-track but a related question.

While collecting statistics on a table,I got "out of spool space " error,so much so that it resulted in mini cyl-pack alert. I took off the session but what's the solution for such a situation .Why I got spool error while collecting stats?

Does that mean I cannot collect stats even on that table?

Looking for your valued insight.