I’ve been telling you for years to transform your short all-AMP queries into single-AMP queries, whenever you can. I’ve even given you pointers on using stored procedures, join indexes and smart application design to achieve that goal.

But when it comes to random AMP sampling, I’m asking you to ignore all that, and give some thought to converting your random AMP sampling from one to all-AMPs.

In this article I’ll give you my perspective on the advantages of spreading this behind-the-scenes statistic-gathering effort across all AMPs in the configuration, and I'll suggest when it can help you. I’m not recommending that everyone, or even most of you, make this change.  But don’t brush it off until you’ve given it a fair listen.  First, here’s a brief refresher on what random AMP samples are.

The optimizer builds an execution plan for each SQL statement that enters the parsing engine. The act of collecting statistics on the underlying tables and columns ensures that reliable and current demographic information about the objects referenced in the SQL is available to the optimizer so that those plans can be as well-performing as possible.

If no statistics have been collected, the optimizer will make a rough estimate of a table’s demographics by using dynamic samples from one or more AMPs (one being the default). These samples are collected automatically each time the table header is accessed from disk and are embedded in the table header when it is placed in the dictionary cache.

The AMP that is sampled is determined by the table ID of the table for which demographics are being sought. The assumption behind random AMP sampling is that pulling a small sample from one AMP is fast enough to be unnoticeable, and will provide enough information to generally characterize the table.

Random AMP sampling produces simple, limited demographic information. Most importantly, it comes up with an estimate of the total rows in the table. Secondly, it samples rows from each non-unique secondary index (NUSI) subtable and extrapolates total distinct values (and from that, an estimated number of rows per value) for each NUSI that has been defined on the table. It ignores non-indexed columns completely and does not try to draw sophisticated conclusions. In addition, random AMP sampling doesn’t build interval histograms, as full statistics collection does.

Estimated values derived from random AMP samples are used not only for table row counts, but for other purposes as well, including join cardinality and aggregation estimates.

How Single-AMP Random AMP Sampling Works

Random AMP sampling is the fastest of all the statistics collection options. One cylinder index on the targeted AMP is examined, as opposed to reading base table rows directly. The cylinder index carries a row count for each of the table’s data blocks, as well as the number of data blocks in the cylinder. These counts, along with a count of the number of cylinders that the table spans, are used to extrapolate the number of rows for that table on that AMP. The estimate of row counts for the entire table is determined to be equal to the number of rows from the sampled AMP multiplied by the number of AMPs in the configuration.

Statistics derived from random AMP sampling are never kept permanently in data dictionary tables. Rather they are held within the table header in the dictionary cache until pushed out by something more current that needs cache space, or until the normal 4-hour purge process, or until the dictionary cache is spoiled for that table. As a result, as changes are made to the underlying data, the sampling will always be reasonably fresh.

The All-AMPs Option

A recent enhancement to random AMP sampling allows more than one AMP to be sampled. By changing an internal field in the DBSControl record called Ran-domAMPSampling, it can be requested that sampling be performed on 2 AMPs, 5 AMPs, all AMPs on a node, or all AMPs in the system.

When using these options, random sampling uses the same techniques as single-AMP random AMP sampling, but more AMPs participate. I don’t advise you to consider 2- or 5-AMP random AMP sampling, because in most cases the added overhead will not provide enough value. All-AMP sampling is the preferred approach if you desire a broader scope, particularly when the primary index has an uneven distribution of values.

All-AMP random AMP sampling has been improved in Teradata 12 in a way that significantly reduces message overhead. Rather than relying on multiple messages and all-AMP aggregations, all-AMP sampling uses the “Last Done” channel mechanism. A channel is a software construct associated with the BYNET that simplifies communication among all the resources working on the same job. Channels replace messaging with signaling, and allow the database to accumulate simple information with lower latencies and minimal network traffic.

Using this technique, the parsing engine (PE) makes a single broadcast to all AMPs. Each AMP polls its results on the channel, after performing the basic “single-AMP sampling” routine that is described above. The last AMP to post a response sends a message with the results back to the PE.

When Will it Help You?

There are two key advantages of all-AMP random samples:

  1. They are better equipped to detect skew in the table’s primary index. Using all AMPs, the quality of random samples goes up and a more accurate row count estimate for the table is possible, when rows are not evenly distributed. Getting a better row count estimate can improve query plans and plan estimates when full statistics are not yet in place or have not been collected.
  2. The statistics extrapolation feature of Teradata 12 will be applied more effectively for small tables or for tables whose primary index is skewed, if all-AMP random AMP sampling is turned on. An accurate row count is critical for detecting stale statistics, and a sample from a single AMP in the face of skew may be incomplete.  

All-AMP random AMP sampling is on by default for the following constructs, in order to get more accurate dynamic statistical information in the absence of full statistics collection.

  • Volatile tables
  • Single table sparse join indexes
  • Single table aggregate join indexes

And It Doesn’t Cost Much

In a recent engineering investigation that contrasted the time it takes for single-AMP random AMP sampling against the time it takes for the all-AMP activity, all-AMP random AMP samples were found to add only 15% more time onto the process.  Admittedly, the testing was performed on a relatively small configuration.  While it is not expected that the time for all-AMP sampling will increase significantly on your system beyond this 15%, a configuration with a very large number of AMPs could experience increased overhead.  Keep in mind that even though more work will be required when larger numbers of AMPs are present, the sampling will be performed in parallel across all AMPs.

Although it comes with slightly more effort, the results of all-AMP sampling will remain in the dictionary cache for up to 4 hours, just as is true for single-AMP sampling. Random AMP sampling of any kind is not expected to happen frequently, and its results are cached and shared among many plan-producing queries that arrive on the same PE.

vincent91 20 comments Joined 02/10
12 Feb 2010

Thanks for your interesting article.

I'm interested in the COLLECT STAT USING SAMPLE and I need your help to make things clear in my mind.

1/ In TD User Doc V12, I read the follwing in Chapter 8 :
"When sampling, you need not specify the percentage of rows of the table to sample. By default,
Teradata Database will begin with a 2% sample. If, after generating statistics for that sample,
Teradata Database detects any skewing, it will increase the sample bit by bit to as high as 50%
until it determines that the sample percent is in line with the observed skewing".
My question is :
If I send the order "collect sTATS USING SAMPLE on mytable column cod_agc ;"
how can I know afterwards the percentage of rows (between 2% and 50%) used for collecting stats on my table ?

Currently I'm testing the collect stat using sample and this information would give me an indication on a very skewed table.

2/ I put the internal field RandomAmpSampling = A (all AMPs in the system) because I want to use all-AMP sampling.

But I'm wondering about the part of the other internal field RowsSampling in collect stat using sample. I don't have any idea about the appropriate value to put in RowsSampling. Honestly I don't know if this parameter plays a part in the all-AMP random samples.
Please, can you give me your advice on it ?

Thanks you for your help.
Vincent from France

carrie 595 comments Joined 04/08
15 Feb 2010

Hello Vincent,

There are a couple of questions here, so let me address then one at a time.

1. USING SAMPLE - The percentage of rows read with the USING SAMPLE option will not increase for non-PI columns. Only with NUPI columns will the sampling percent be gradually increased if skew is detected in that column during the stats collection process. If this were to happen to you, which is only the case if you are collecting stats on a NUPI, then you can tell what percent of the rows were actually read by looking at the SampledPercent field in Interval zero of the histogram that is produced.

2. RandomAmpSampling and RowsSampling fields -I'm not sure I discussed the tradeoffs of all-AMP random AMP sampling in a balanced enough way in the article you are commenting on. So I'm glad you added this comment. The thing to be aware of with RandomAMPSampling field is that if you have a large number of AMPs (high hundreds, thousands), more extensive sampling could have an impact on query parsing times. It may not, but it could. So I would advise anyone changing that parameter to keep an eye on that. After you change the parameter, check to see if you are experiencing longer times in the PE for your queries as a result of the change (FirstStepTime - StartTime). If that becomes problematic, reverse back to single-AMP sampling.

RowsSampling allows you to increase the total number of rows to be sampled on an AMPs when random AMP sampling takes place. I don't have any guidelines for you of when this would be good to change. But the tradeoff here is similar, increasing RowsSampling you could cause more work at the PE level (more I/Os). Since it sounds like you have already changed to all-AMP random AMP sampling, I'd stop there and see how that works for you without making additional changes in this area. If you are at all confused about these settings and how they apply to you, work with the support center and follow their recommendations as to whether or not changing any internal parameter is good for your particular platform or not.

vincent91 20 comments Joined 02/10
19 Feb 2010

Thank you Carrie for these precious information. I regret that there are nothing in Teradata Doc V12 on this.

I did a test and the percentage of rows read increased with PI (UPI & NUPI) columns and suprisingly, with SI (USI & NUSI) columns too.

I called my support center and they were not recommended me to change RandomAmpSampling and RowsSampling­ because it "may be adding overhead to the parsing of ALL queries....".
I'm a little bit sceptical, how these 2 parameters on collect stat could impact QUERIES ....

carrie 595 comments Joined 04/08
22 Feb 2010

Thank you for sharing your experiences.

I would recommend to you to always follow the advice of the support center.

The reason that turning on all-AMP sampling could impact query performance is based on the fact that there is some additional effort, and possibly time, to perform the random AMP sample activity during the parsing phase. You would be accessing all AMPs instead of one AMP. This would be a concern on a system with a large number of AMPs more than on a system with a low number of AMPs. This could increase parsing time.

KeithJones 5 comments Joined 07/10
15 Sep 2010

Carrie, the manual implies that the RandomAMPSampling dbscontrol setting not only affects the scope of the random amp sampling when no collected statistics exist, but also the scope of the COLLECT STATISTICS USING SAMPLE statement. Is that true, or does the COLLECT statement always use all amps?

carrie 595 comments Joined 04/08
17 Sep 2010

Hi Keith,

The COLLECT STATISTICS statement will always engage all AMPs, whether or not USING SAMPLE has been specified. If USING SAMPLE is executed and the default 2% sampling rate is used, then each AMP will access 2% of its data blocks. Only random AMP sampling has the option of being a single AMP access or multiple AMP access.

Thanks, -Carrie

tintin 1 comment Joined 01/12
27 Feb 2012

Hi Carrie,

I have a doubt. I believe, random amp sampling is chosen for the indexed column[Indexed selection criterion with no non-equality join] only and for non-indexed columns optimizer uses heuristics. But as per your description above it seems that for non-availability of statistics or with slate statistics, optimizer uses random amp sample statistics irrespective of whether the selection criterion is indexed or non-indexed.
Kindly clarify

carrie 595 comments Joined 04/08
28 Feb 2012

There are two different types of information produced by random AMP sampling:

1.) An estimate of the total rows in the table
2.) Extrapolation of total distinct values for each NUSI on the table.

So you are correct that random AMP sampling only provides detailed information (number of distinct values) for NUSI columns. However, the row count estimates that come out of random AMP sampling (#1 above) is used as a base upon which the heuristics are applied.

If the heuristic says 10% of the rows in a table will quality based on this predicate that does not have stats and is not a NUSI, the random AMP sample provides the information to tell the optimizer "10% of what?" It provides the current estimated row count of the table.

Thanks, -Carrie

MarkVYoung 20 comments Joined 03/12
03 Jul 2013

Hi Carrie,
We are investigating moving from Single AMP Sampling to AllAMPSampling and I am trying to work out a method to allow us to do this easily!
Are you aware of the views provided by Dieter?
If so, could we look at the variation between what is returned in the OneAMPSampleEst and the AllAMPSampleEst compared with the NumRows to see how often there is large descrepency, to make that determination?
If that might be the case, could we also use these Est(imate) columns to determine if we even need the stats collected we have if all these values are within a small percentage of each other. Two examples shown below:
NumRows               OneAMPEst            ALLAMPEst            CollectDuration 
1,608,977,669       1,618,420,224       1,610,245,524       00:55:56.31
   578,902,658       1,317,215,544       1,287,298,808       00:01:40.32
So in the first case, we probably don't need to collect stats on this column at all, as both OneAMP and AllAMP estimates match the NumRows. In the second case, it would seem that if we used ALLAMPSampling, then we would be providing better statistics to the optimiser!
Are my assumptions correct, or do these values mean something else entirely?

carrie 595 comments Joined 04/08
08 Jul 2013

Yes, you can see whether all AMP random AMP sampling or single AMP random AMP sampling is going to be close to the same by comparing them both.   All AMP sampling can only be turned on in DBS Control, so you could get the estimates from the explain text that scans a table both before and after changing that parameter.
Also, since turning on all AMP sampling is a DBS Control setting, all tables in the database will be impacted.  You cannot decide to use all AMP sampling on a table by table basis.
Random sampling, whether single AMP or all AMP, has nothing to do with column-level statistics.  Here is when random AMP sampling takes place:
If no statistics have been collected, the optimizer will make a rough estimate of a table’s demographics by using dynamic samples from one or more AMPs (one being the default).   These samples are collected automatically each time the table header is accessed from disk and are embedded in the table header when it is placed in the dictionary cache.
Random AMP sampling produces simple, limited demographic information.  First, it comes up with an estimate of the total rows in the table.  Secondly, it samples rows from each non-unique secondary index (NUSI) subtable and extrapolates total distinct values (and from that, an estimated number of rows per value) for each NUSI that has been defined on the table.  It ignores non-indexed columns completely and does not try to draw sophisticated conclusions.
Thanks, -Carrie

mohdmufid4u 4 comments Joined 10/10
24 Sep 2013

Hi Carrie,
I have a table which has a column say X whose cardinality is very low, for example consider a table with billions of records having a column named src_syst_c.
The column src_syst_c is having 50 unique values.
However collecting random sampling stats on this column can give me any number higher or lower, what are the pros and cons of collecting random amp sampling stats for these columns with low cardinality on big tables ?

mohdmufid4u 4 comments Joined 10/10
24 Sep 2013

I got my answers in your another blog.. thanks

You must sign in to leave a comment.