Database Welcome to the Teradata Forums. Read the guidelines on posting. Email notifications and RSS feeds are available, and also a chronological list of all forum activity.
 Print All Forums Database m.tahoon 43 posts Joined 09/11 28 Feb 2013 Sampled Stats vs Time trade-off Our Goal is we need to minimize collect sats time: we do use Combination for Full / Sample stats based on Row size: - Smaple 50% for all table's statitics combination for tables  > 5 Billion Rows - FULL for all table's statitics combination for tables  < 5 Billion Rows   While this still take long time and we came accross this idea, to minimize it - we will not evaluate full or sample bu total row count, but for the number of values in each stats combination Ex: if # of rows is 2 bilion, but # of values by stats combination (col1,col2) = 10billion -- >  the we use sampled stats for this column combination specifically if # of rows is 10 bilion, but # of values by stats combination (col3,col4) = 2billion -- >  the we will not use sampled stats for this column combination specifically   Is logical to use this approach or we need to stck to total # of rows for the table Tags: dnoeth 4628 posts Joined 11/04 28 Feb 2013 Did you read Carrie Ballinger's blogs on "Statistics Recommendation" for your release? 13/13.10/14? For some really large tables there's no other way than sample stats. You just have to be carefull when a column is correlated to a partitioning column, e.g. partition by date and another column year_month. Never collect sample stats on year_month without also including the date, because sampling just starts to read the first blocks and a partitioned table is sorted by partition number... Dieter   Dieter m.tahoon 43 posts Joined 09/11 03 Mar 2013 Thanks Dieter Carrie blog is very useful and having date columns not samples is on our plan now. I've noticed when minimizing the Sample size the number of values computed decreases; which can be expected. My Concern about the point you mentioned it will always gets the first % blocks specified in the sample which can be the first / last blocks inserted. in case of highly unique columns ( only 10 values for the column for the column); a possibly bad plan may occur; once a specific value of these 10 values is missed in the samples stats and used in a predicate; which what's the trade-off for accuracy / stats time is all about. My question : is there any way we can change the way the sample is collected other then picking the first blocks; like random blocks for the same percentage ? dnoeth 4628 posts Joined 11/04 04 Mar 2013 For a table with a PI but no partitioning the sampling approach is usually ok, a table is sorted by RowHash -> not based on the order of inserts and looks quite random (for a NoPI table it *is* a chronological order). When you SAMPLE on a partitioning column in a PPI table the optimizer is smart enough to know about that and samples the first x percent from *each* partition. But there's not way to change the optimizer strategy (although it would be nice to have regarding the problem with correlated date) Dieter Dieter You must sign in to leave a comment. Active Posters