All Forums Database
quest 20 posts Joined 12/09
22 Feb 2013
How to validate whether my PI selection is right?

I have defined a PI for an table. Now I need to validate whether the columns selected for PI are the right candidate. What are the ways to go for it?
How can I showcase whether all the AMP's are optimally used?
Which tools should I use to focus on performance aspect?

Wasif.Kabeer 12 posts Joined 11/11
22 Feb 2013


You may check the skew factor of the PI column by using the below query. Minimum the skew factor, better is the distribution.




Get Skew factor of a single column


SEL (100 - (AVG(total)/MAX(total)*100)) as skewfactor




COUNT(*) total

FROM DatabaseName.TableName



tab1 ;


(Code snipped added special characters, hence didn't use it).




VandeBergB 182 posts Joined 09/06
23 Feb 2013

The thing about the Primary Index is that ISN'T all about data distribution.  If you select a PI that perfectly distributes the data across the AMPS, but is never used as a Join column or in a query predicate, you're doing yourself and your database a dis-service.
A key priority of the PI is that it should define the most common access path, some data skew is perfectly acceptable.  There are probably as many definitions for accetable skew as there are people who can spell "skew" don't look for one "perfect" answer, personally I try to keep data skew below 20-25.
When you define the same PI on two commonly joined tables, you enable your TD system to create AMP local joins, with no redistribution.  You can see this in the explain plan. 
Remember, data has to be on the same AMP to be joined.  If you have two tables with different PI's, the optimizer will pick the smaller (if it can) of the two tables, assuming you have current statistics collected, do a FTS on the smaller table, bring it into spool, re-hash the rows on that table in spool and push it back down to the AMPs according to the re-hashed column ( the PI of the larger table) and THEN join your rows.
If you can define the same PI on two freuqently joined tables and give up some skew, you can eliminate the FTS, re-hash and redistribution.
Consider the Sales Order Header and Sales Order Line Item tables, a fairly common occurence.  The natural identifier for the Header Table and by default, initial PI selection, is some thing SalesOrderHeaderID.  The Line Item table also has a natural identifier, but in this case it happens to be a composite PI, composed of SalesOrderHeaderID ( to relate each line back to its parent header row) and LineItemID (to differentiate each line item within the order).
If you select SalesOrderHeaderId, LineItemID as a composite PI on the Line Item Table, you've now hashed ALL of your Item rows to different amps than your header rows, perfectly distributed, but NOT co-located.  The query that joins headers and line items now has to grab the line item table complete a full table scan, rehash it to the salesorderheaderid and then join it. All you've achieved is data distribution at the more complex processing.
If you accept some skew and define the PI of the LineItem table as SalesOrderHeaderId alone, the child rows will hash to the same amps as their parent rows, run the same query and you've now got amp local operations...

Some drink from the fountain of knowledge, others just gargle.

quest 20 posts Joined 12/09
23 Feb 2013

Thanks Blaine and Wasif for wonderful replies. Blaine, your reply was surely very insightful and would help me a lot.

quest 20 posts Joined 12/09
24 Feb 2013

Continuing on the above..
We have the concepts of foreign keys (either Soft RI or hard RI). It is oftne recommended to go for soft RI since it wont hard stop any insertion of records as well used by query optimizer.
How can one benefit by using RI (used by optimizer) compare to PI definition? Is there any substitute of not opting for PI having composite key and instead going for RI?

You must sign in to leave a comment.