All Forums Database
prakash_r02 8 posts Joined 08/10
07 Sep 2010
Collect Stats - How Performance gets improved?

From Books:

When doing collect stats on fields/indexes , system collects the information like: total row counts of the table, how many distinct values are there in the column, how many rows per value, is the column indexed, if so unique or non unique etc

The above information are known as statistics.
______________________________________________________ _____________________
Example:
Table "Emp_Details" has the primary index "Emp_Name". I use this index "Emp_Name" in my joining condition/Where clause.

case1: I dont have collect stats on "Emp_Details" table.

sel * from Emp_Details
where Emp_Name = 1000;

The above query performs good as i used the primary index in where clause.

case2: I have collected stats on "Emp_Details" table on the index "Emp_Name".

sel * from Emp_Details
where Emp_Name = 1000;

Most of us say the above query performs better than that in the case 1.
______________________________________________________ ______________________
Questions:

1. Where are these statistics stored?

2. How does the Parsing Engine use the statistics for the better performance of a query?

3. Why do i need to collect stats on the indexed columns , despite the fact
using indexed columns in where clause/joins itself will give better performance?

Karam 75 posts Joined 07/09
09 Sep 2010

1. Where are these statistics stored?

Statistics are stored in DBC views like ColumnStats,IndexStats and MultiColumnStats

2. How does the Parsing Engine use the statistics for the better performance of a query?

Statistics provide detailed information, and include an exact row count as of the time that the statistics were gathered. Therefore, when the Optimizer (Parsing Engine) finds that there are statistics available for a referenced table, it will use those statistics instead of using estimated table cardinality or estimated unique index value counts (that is, it will trust the statistics)
With such information, optimizer is better prepared to decide the fastest way of query execution.This results in improvised query output which is closer to EXPLAIN suggestions.

3. Why do i need to collect stats on the indexed columns , despite the fact
using indexed columns in where clause/joins itself will give better performance?

Depends , few of the recommendation are as follow:

a)You should always collect stats on NUSIs. The optimizer has little chance of properly using your indexes without stats.

b) The UPI of tables with less than 1000 rows per Amp.

As said , read them as recommendation rather than hard and fast rule.

prakash_r02 8 posts Joined 08/10
10 Sep 2010

Karam, Thanks for your answers. Still i have the following questions
-------
"1. Where are these statistics stored?

Statistics are stored in DBC views like ColumnStats,IndexStats and MultiColumnStats "

Yes , the statistics informations like column/index, unique values, date & time of collect stats are stored in the DBC's tables as specified by you.
-------
where does the following informations are stored?
1.how many rows per value,
2.is the column indexed, if so unique or non unique etc

Does Parsing engine refers to DBC tables for getting stats information??
------

"when the Optimizer (Parsing Engine) finds that there are statistics available for a referenced table, it will use those statistics instead of using estimated table cardinality or estimated unique index value counts (that is, it will trust the statistics)
With such information, optimizer is better prepared to decide the fastest way of query execution."
------
I have a question on the above answer

for instance,If i use primary index in join/where clause , the optimizer passes the query through hasing algorithm which is the way of finding the best path to data.
How optimizer selects the best path with the help of collect stats informations like number of unique values, how many rows per value and all.?
---------
3. Why do i need to collect stats on the indexed columns , despite the fact
using indexed columns in where clause/joins itself will give better performance?

Depends , few of the recommendation are as follow:

a)You should always collect stats on NUSIs. The optimizer has little chance of properly using your indexes without stats.

b) The UPI of tables with less than 1000 rows per Amp.

As said , read them as recommendation rather than hard and fast rule.
------
Could you please explain the above answers for my 3rd question in detail.??

Thanks in advance..

Karam 75 posts Joined 07/09
12 Sep 2010

where does the following informations are stored?
1.how many rows per value,
2.is the column indexed, if so unique or non unique etc
Does Parsing engine refers to DBC tables for getting stats information??
----------
Ans - Data demographic information is stored in DBC tables. Refer -Teradata SQL Reference, Statement and Transaction Processing pdf for details.
For getting stats info , PE read either cache or disk depending on table's usage in recent queries.
For your another question on hashing - my take is :
Hashing is purely used for data storage with dependence on PI value uniqueness.It can point to the exact AMP used for storing a particular PI Where as statistics provide precise data demographic information to the cost based optimiser , such as:
Number of rows in the table
Number of rows per value
Row size
Column demographics
Index demographics

Hope it helps.

prakash_r02 8 posts Joined 08/10
16 Sep 2010

Karam,Thanks for your answers. can you please assist me in the follwing
.
. I have the follwing query.
.
.SELECT * FROM
.
.FROM DB.A_Table TDR
.
.LEFT OUTER JOIN
. (
. SEL B_Field1,
. B_Field2,
. B_Field3
. FROM DB.B_Table
. WHERE
. B_Field4 = 11005
. AND B_Field3 = '1'
. ) LOAN
.
.ON
. TDR.A_Field1 = LOAN.B_Field1 AND
. TDR.A_Field2 = LOAN.B_Field2
.
.LEFT OUTER JOIN
. (
. SEL C_Field1,
. C_Field2,
. C_Field3
. FROM DB.C_Table
. WHERE
. C_Field4 = 11005
. AND C_Field3 = '1'
. ) LOANH
.
.ON
. TDR.A_Field1 = LOAN.C_Field1 AND
. TDR.A_Field2 = LOAN.C_Field2
.
.WHERE
. TDR.B_Field4 = 11005
. AND LOAN.B_Field3 IS NULL
. AND LOANH.C_Field3 IS NULL;
__________________________________________
.
Informations of the the tables
.DB.A_Table :
.------------
.Total Record Count : 1,000,000
.Indexs :
. UNIQUE PRIMARY INDEX PMY_IDX (A_Field5,A_Field6,A_Field1,A_Field4)
. PARTITION BY RANGE_N on A_Field4
.
.Number of Unique values present in the fields:
.
.A_Field5 34
.A_Field6 5
.A_Field1 200,000
.A_Field4 20
.A_Field2 24
.
.
.DB.B_Table :
.------------
.
.Total Record Count : 800,000,000
.Indexs :
. UNIQUE PRIMARY INDEX PMY_IDX (B_Field1,B_Field4)
. PARTITION BY RANGE_N on B_Field4
.
.Number of Unique values present in the fields:
.
.B_Field1 30,000,000
.B_Field2 1
.B_Field3 3
.b_Field4 128
.
.
.DB.C_Table
.----------
.
.Total Record Count : 90,000,000
.Indexs :
. UNIQUE PRIMARY INDEX PMY_IDX (C_Field1,C_Field4)
. PARTITION BY RANGE_N on C_Field4
.
.Number of Unique values present in the fields:
.
.C_Field1 3,000,000
.C_Field2 1
.C_Field3 3
.C_Field4 128
-----------------------------------------------------------------------
.
.Please guide me on the following
.
1. I need to collect stats on the above 3 tables . How to choose the fiedls for collecting stats. Please recomand me the list of collect stats those needs to be run on the 3 tables.
.
2. On what basis you choose the columns for collecting stats?
.
.Thanks in advance

robpaller 159 posts Joined 05/09
16 Sep 2010

Stats generally should be collected on all indexes, join columns, partitioning columns, the PARTITION column for tables with multi-level partitioning, and columns that exist in the WHERE clause as selection criteria. Carrie Ballinger has a series of blog posts on the Dev Exchanges relating to statistics with some valuable tips found in the comments. (Like collecting sample stats on a unique primary index instead of full stats.)

Stats should be recollected when 10% of the data has changed for the column(s) with collected stats or the stats have aged say 90-180 days. (Your mileage may vary)

You can submit DIAGNOSTIC HELPSTATS ON FOR SESSION and then subsequent EXPLAIN results will included suggested stats to collect that do not exist that *MAY OR MAY NOT* improve the query plan. Collecting stats to satisfy the plan of one query can affect the plan of other queries and should be regression tested.

Find Carrie's discussion regarding mulit-column statistics and the impact of the uniqueness on the first 16 bytes in the histogram. This impact does not affect the number of unique values but does impact other values capture in the histogram.

Hope this helps.

urskumar 6 posts Joined 08/07
12 Dec 2010

Hi,

I don't have collect statistics on my table or any of the columns in table.. For my SQL query, In my explain plan i have 'NO CONFIDENCE'..........
Is there is any possibility to improve the query performance with out taking collect statistics???
Please advise..........................

kumar

gryback 271 posts Joined 12/08
13 Dec 2010

Worth mentioning, collecting statistics is a key component of TASM/workload management. Stats collection results in better optimizer estimates which TASM then uses for pre-processing decisions and classification. For instance, putting a groups longer running queries in their own workload definition (WD) or maybe a groups queries with estimated row returns of more than 1M rows. All these preprocessing decisions depend on optimizer estimates.

sanjeev_1995 3 posts Joined 04/11
25 Apr 2011

Hi,

Plz help in the below posted query :

I have a table called SYS_PUR having the following definition :

CREATE MULTISET TABLE SYS_PUR ,NO FALLBACK ,
NO BEFORE JOURNAL,
NO AFTER JOURNAL,
CHECKSUM = DEFAULT
(
FOLIO_KEY INTEGER TITLE 'FOLIO_KEY' NOT NULL,
SCHEME_KEY INTEGER TITLE 'SCHEME_KEY' NOT NULL,
TIME_KEY INTEGER TITLE 'TIME_KEY' NOT NULL,
TRANSACTION_NO DECIMAL(16,0) TITLE 'TRANSACTION_NO' NOT NULL,
TRANSACTION_TYPE VARCHAR(7) CHARACTER SET LATIN NOT CASESPECIFIC TITLE 'TRANSACTION_TYPE' NOT NULL,
FROM_DATE DATE FORMAT 'YY/MM/DD' TITLE 'FROM_DATE' NOT NULL,
TO_DATE DATE FORMAT 'YY/MM/DD' TITLE 'TO_DATE' NOT NULL,
SUB_BROKER_KEY INTEGER TITLE 'SUB_BROKER_KEY' NOT NULL,
BRANCH_KEY INTEGER TITLE 'BRANCH_KEY' NOT NULL,
APPLNO VARCHAR(15) CHARACTER SET LATIN NOT CASESPECIFIC TITLE 'APPLNO',
ENTRY_DATE DATE FORMAT 'YY/MM/DD' TITLE 'ENTRY_DATE',
CEASED_DATE DATE FORMAT 'YY/MM/DD' TITLE 'CEASED_DATE',
AMOUNT DECIMAL(25,8) TITLE 'AMOUNT',
ACTFLAG CHAR(1) CHARACTER SET LATIN NOT CASESPECIFIC TITLE 'ACTFLAG',
FREQUENCY VARCHAR(5) CHARACTER SET LATIN NOT CASESPECIFIC TITLE 'FREQUENCY',
PAYMECH VARCHAR(7) CHARACTER SET LATIN NOT CASESPECIFIC TITLE 'PAYMECH',
SIPDATE VARCHAR(50) CHARACTER SET LATIN NOT CASESPECIFIC TITLE 'SIPDATE',
BANK VARCHAR(40) CHARACTER SET LATIN NOT CASESPECIFIC TITLE 'BANK',
ACTYPE VARCHAR(5) CHARACTER SET LATIN NOT CASESPECIFIC TITLE 'ACTYPE',
INSTRMNO VARCHAR(20) CHARACTER SET LATIN NOT CASESPECIFIC TITLE 'INSTRMNO',
ACNAME VARCHAR(70) CHARACTER SET LATIN NOT CASESPECIFIC TITLE 'ACNAME',
MICRNO VARCHAR(20) CHARACTER SET LATIN NOT CASESPECIFIC TITLE 'MICRNO',
BROKCODE VARCHAR(21) CHARACTER SET LATIN NOT CASESPECIFIC,
PURCHASE_REMARKS VARCHAR(31) CHARACTER SET LATIN NOT CASESPECIFIC,
BATCH_NUM INTEGER)
PRIMARY INDEX ( FOLIO_KEY ,SCHEME_KEY ,TIME_KEY ,TRANSACTION_NO ,
TRANSACTION_TYPE ,FROM_DATE ,TO_DATE ,SUB_BROKER_KEY ,BRANCH_KEY ) ;

I have another table called TIME_DIM having definition as follows :

CREATE SET TABLE TIME_DIM ,NO FALLBACK ,
NO BEFORE JOURNAL,
NO AFTER JOURNAL,
CHECKSUM = DEFAULT
(
TIME_KEY INTEGER TITLE 'TIME_KEY' NOT NULL,
CALENDAR_DATE DATE FORMAT 'YY/MM/DD' TITLE 'CALENDAR_DATE' NOT NULL,
CALENDAR_WEEK_N INTEGER TITLE 'CALENDAR_WEEK_N' NOT NULL,
CALENDAR_WEEK_W VARCHAR(20) CHARACTER SET LATIN NOT CASESPECIFIC TITLE 'CALENDAR_WEEK_W' NOT NULL,
CALENDAR_MONTH_N INTEGER TITLE 'CALENDAR_MONTH_N' NOT NULL,
CALENDAR_MONTH_W VARCHAR(20) CHARACTER SET LATIN NOT CASESPECIFIC TITLE 'CALENDAR_MONTH_W' NOT NULL,
CALENDAR_QUARTER_N INTEGER TITLE 'CALENDAR_QUARTER_N' NOT NULL,
CALENDAR_QUARTER_W VARCHAR(20) CHARACTER SET LATIN NOT CASESPECIFIC TITLE 'CALENDAR_QUARTER_W' NOT NULL,
CALENDAR_YEAR INTEGER TITLE 'CALENDAR_YEAR' NOT NULL,
FINANCIAL_WEEK_N INTEGER TITLE 'FINANCIAL_WEEK_N' NOT NULL,
FINANCIAL_WEEK_W VARCHAR(20) CHARACTER SET LATIN NOT CASESPECIFIC TITLE 'FINANCIAL_WEEK_W' NOT NULL,
FINANCIAL_MONTH_N INTEGER TITLE 'FINANCIAL_MONTH_N' NOT NULL,
FINANCIAL_QUARTER_N INTEGER TITLE 'FINANCIAL_QUARTER_N' NOT NULL,
FINANCIAL_QUARTER_W VARCHAR(20) CHARACTER SET LATIN NOT CASESPECIFIC TITLE 'FINANCIAL_QUARTER_W' NOT NULL,
FINANCIAL_YEAR VARCHAR(20) CHARACTER SET LATIN NOT CASESPECIFIC TITLE 'FINANCIAL_YEAR' NOT NULL,
DATE_OFFSET INTEGER TITLE 'DATE_OFFSET',
WEEK_OFFSET INTEGER TITLE 'WEEK_OFFSET',
MONTH_OFFSET INTEGER TITLE 'MONTH_OFFSET',
QUARTER_OFFSET INTEGER TITLE 'QUARTER_OFFSET',
YEAR_OFFSET INTEGER TITLE 'YEAR_OFFSET',
FINANCIAL_YEAR_OFFSET INTEGER TITLE 'FINANCIAL_YEAR_OFFSET',
BATCH_NUM INTEGER)
UNIQUE PRIMARY INDEX ( TIME_KEY )
UNIQUE INDEX TIME_DIM_IDX ( CALENDAR_DATE );

I am trying to update a column of SYS_PUR TABLE USING NON primary key joins as follows :

update SYS_PUR
from
(
select table1.entry_date entry_date,time_dim.time_key time_key from SYS_PUR table1,
(select * from time_dim) time_dim
where time_dim.calendar_date = table1.entry_date
) table2,
(
select table1.ceased_date cease_date,time_dim.time_key time_key from SYS_PUR table1,
(select * from time_dim) time_dim
where time_dim.calendar_date = table1.ceased_date
) table3
set cease_time_key = table3.time_key,
entry_time_key = table2.time_key
where
SYS_PUR.entry_date = table2.entry_date and
SYS_PUR.ceased_date = table3.cease_date

The above query says its going to take some 1000 hours to finish the same in explain plan. Please advice if there is
any work around of doing the same in lesser time

Regards,
Sanjeev

chillerm 15 posts Joined 04/11
27 Apr 2011

Hey Sanjeev.

You shouldn't alias your table as the same table name. This will force a product join, which will give you that high estimated completion time.
Try:

update SYS_PUR
from
(
select table1.entry_date entry_date, td.time_key time_key from SYS_PUR table1,
(select * from time_dim) td
where td.calendar_date = table1.entry_date
) table2,
(
select table1.ceased_date cease_date,td.time_key time_key from SYS_PUR table1,
(select * from time_dim) td
where td.calendar_date = table1.ceased_date
) table3
set cease_time_key = table3.time_key,
entry_time_key = table2.time_key
where
SYS_PUR.entry_date = table2.entry_date and
SYS_PUR.ceased_date = table3.cease_date

dnoeth 4628 posts Joined 11/04
27 Apr 2011

Hi Sanjeev,
your update query doesn't match the table definitions, there's no cease_time_key and entry_time_key.

Your update will not work as there will be a huge intermediate spool because there are no 1-to-N-joins, which will finally lead to a an error 7547: Target row updated by multiple source rows.

This is untested, maybe it's working:
update SYS_PUR
from time_dim as dim1, time_dim as dim2
set cease_time_key = dim1.time_key,
entry_time_key = dim2.time_key
where
SYS_PUR.entry_date = dim1.calendar_date and
SYS_PUR.ceased_date = tdim2.calendar_date

But for me the real question is:
Why do you add an artificial time_key as PK to the TIME_DIM instead of joining to the calendar_date?
There's no time portion in TOME_DIM, it's a simple calendar table with a single row per date.

Dieter

Dieter

sanjeev_1995 3 posts Joined 04/11
28 Apr 2011

Thnx all for your response. I have tried what you suggested "chillerm" . Giving a different alias name doesnt help since the explain plan gave the same huge amount of hours as shown ealier.

I have already implemented what you have suggested "dnoeth" , this way the query performance improves a lot. Artificial time key addition may help in faster join since its integer join than date joins. I agree it may not be a huge performance advantage still in all our olap tables dates are replaced by keys.

dnoeth 4628 posts Joined 11/04
28 Apr 2011

You probably didn't work with Teradata before.
Of course there's an overhead for joining on multiple columns or large varchars, but it's low compared to Oracle et al.
In Teradata you don't have to join on a single numeric column to be fast enough :-)

A Teradata DATE is stored as a 4 byte integer exactly as your TIME_KEY, so why should there be any difference?
In fact as your table grows due to the additional column everything will be a bit slower than before.

Dieter

Dieter

ratnamch 14 posts Joined 06/12
21 Oct 2012

 
hi ,
 
what is demografics ?
how many types of demografics?
how it will work in teradata?
 
could you please explain brief in step by step?
 
thanks
ratnam

ratnamch 14 posts Joined 06/12
28 Oct 2012

i have a table one columuns ?

ename
aravindh
bargav
chendu
dsnunjay
eswar
ferooj
ganesh
hari
iliaj
jersy
kalyan

but i want to starting ename leter (a to f) details how to retrive the data?

1.hoe to find ename charecter start leter (a to f) ?
retrive the data?

sieger007 5 posts Joined 09/15
16 Sep 2015

I have a column whose access count rank is like # 9 amongst 150 column in a table ( it is widely accessed ) but the distinct value for it is just one  ( 'y').
Does it help collecting stats on it even with such a less variation. When does it help and when is it misleading to the optimizer.
What else could one do to optimizer joins involving this column

ToddAWalter 316 posts Joined 10/11
17 Sep 2015

If it is frequently accessed via a condition (as opposed to just appearing in the select result list) then it is important to collect stats. The optimizer will assume any column without stats has a median distribution. If your data is much more unique or, as in this case, much less unique then you should collect stats to let the optimizer know this.
 
If the data is going to stay this way, always just one value, then it should not be necessary to re-collect the stats, just collect them once. But remember to collect again if the column is populated with other values.

sieger007 5 posts Joined 09/15
24 Sep 2015

Thank You.That makes a LOT of sense.  The obvious was staring at my face , yet I could'nt  connect the dots.
I had another post over . No hits on it so far. When you have a min Todd....
http://forums.teradata.com/forum/database/how-to-optimize-queries-that-have-aggregation-involving-multiple-fact-tables#comment-143852
 

sieger007 5 posts Joined 09/15
25 Sep 2015

Another aspect of collect stats - combination stats. A  lot of times there are a lot of combination stats suggested by the optimizer . The ones that involve where clause ( and frequently so ) will  benefit from combination stats but typically I will ignore select columns even when suggested by the optimizer. I wondered if there are conditions where select columns stats  ( solitary or in combo  but esp so the latter ) could make a big diff. to impact CPU before and  after stats

You must sign in to leave a comment.