All Forums Database
kelvsdotph 14 posts Joined 06/13
16 Sep 2013
SQL Workaround

hi Teradata forumers,
 
 
 
Can you guys help figure an SQL? here's my problem.  For my query tuning work, Im joining a fairly big table and a small one.The column being joined in the BIG table is only a NUSI, the smaller table is using a NUPI. As I know, NUSI cant have distribution of data so  Im proposing to change the structure of the BIG table to use the column as PI buy, i was advised not to since the table is huge and will impact a lot of users. Now, replicating the big table aint an option since spool and temp space is limited.  A JI can help but the inserts can be slower and might hamper the overall performance/timing. Any ideas guys? What could i use in order to have a good join condition?
 
Some info: 
 
Im bound to some NDA so I cant disclose much info publicly,
 
The TD servers im working on are 12-13. 
 

dnoeth 4628 posts Joined 11/04
16 Sep 2013

Define "fairly big" and "fairly small".
A small table will probably be "duplicated to all AMPs" and then joined. Did you check DBQL if this step is actually consuming too much resources?
 
Dieter

Dieter

kelvsdotph 14 posts Joined 06/13
16 Sep 2013

Hi dieter,
 
The BIG table had 500m records and the small one had 48m. The DBQL shows high CPU Impact, skewness and spool usage.
BTW I was told that our systrm is sitting on 680 AMPs.

Raja_KT 1246 posts Joined 07/09
16 Sep 2013

Hi,
Just my small cents :)
Yes , I do agree a JI will hamper the insert performance. I see that a single table JI can help a little.
How about partitioning a huge table? It is just my thought.
Thanks and regards,
Raja

Raja K Thaw
My wiki: http://en.wikipedia.org/wiki/User:Kt_raj1
Street Children suffer not by their fault. We can help them if we want.

kelvsdotph 14 posts Joined 06/13
17 Sep 2013

Raja,
Thanks, I'll try that out.  
 
For everyone's record, what i did was a query rewrite. I found some columns that can be used to link together. Hopefully I can get a smiling face from my client tomorrow.  Will keep you posted
 
kelvs

dnoeth 4628 posts Joined 11/04
17 Sep 2013

Ok, 48m is not really "fairly small" :-)
Are the tables partitioned? Do you actually join all rows or are there some common filtering conditions? How often are those tables accessed? What other tables are joined to them?
All this (and more) must be evaluated and changing the PI should always be a possible result. 
Without further details it's hard to assist.
 
Dieter

Dieter

dnoeth 4628 posts Joined 11/04
17 Sep 2013

A query rewrite where you change the join-columns will usually change the result set :-)
Can you elaborate on this?
 
Dieter

Dieter

kelvsdotph 14 posts Joined 06/13
17 Sep 2013

Yes, Dieter, that's what I need to test now - to see if the rewrite did change the resultsets.  I do found out that there are joins to NUPI columns and count of distinct is around 30% less than the count of the records.  So I tend to do a subquery to distinct this column first and use this for joining, hope this could help lower the processing.  I tried to maintain the join columns as the original query except for this one, but then again, i'm crossing my fingers right now.

dnoeth 4628 posts Joined 11/04
17 Sep 2013

There's a specific case i know of when it's not changing the result:
e.g. orders and lineitem, both with PI(ordernumber) PARTITON BY orderdate, the join based on the FK will be (ordernumber), but this will result in a very inefficient join, because it's not known in which partition a row is located. Adding the orderdate switches back to a direct merge join. It's hard to enshure that endusers know and apply that,  so there's usually a view pre-joining the tables :-)
 
Dieter

Dieter

kelvsdotph 14 posts Joined 06/13
19 Sep 2013

ah yes, i think i saw things like that dieter on my previous project.  
Anyway, an update to everyone.  I was advised by my partner to just make recommendation of a structural change since modifying the query will only be more taxing.  The performance gain I got from above was very negligible either I wasn't a good query writer or the design is not applicable anymore.
Thanks to everyone who helped :)

kelvsdotph 14 posts Joined 06/13
23 Jan 2014

hi TD forumers,
I'm experiencing a query that goes like this ( due to some legal restrictions, im not permitted to write down codes outside the premise, i'll try my best to write a pseudo code )
SEL A, B, C 
FROM 
( SEL A, B, C FROM TAB1 T1
LEFT JOIN TAB 2 T2 
ON T1.A = T2.A
AND T1.B = T2.B ) Q1
 
LEFT JOIN TAB3 T3 
ON T1.A = T3.A
AND T1.B = T3.B
 
WHERE <blah blah blah>
 
The query above seems nice however, columns A and B of T1 is very skewed ( some of the records when group'ed by both fields yield 2-130+ records ), as for T3 and T2, the fields A and B are unique per record.  Now, here's my question, how do you guys go about rewriting this query if the joining columns of the driving/main table (hence T1 in this query) is very skewed?  The only solution I can think of is reduce the record of the main table to reduce skewness but the skewness wasn't addressed well.  Secondly, I also tried to create a volatile table on Q1 deliberately using column A and B as the PI but the ImpactCPU (of DBQL) seems to be high as well.  Any idea that could help guys?
Kelvin 

Raja_KT 1246 posts Joined 07/09
23 Jan 2014

You have not shared few more infos like 

 

data volume, explain, amps etc if any. You can mask any business-critical stuffs :) .

Are they the same tables as you mentioned above or different? Partitioning ??????/

 

 

My thought:

I feel that since the disease is in skewness of T1, then you need to cure that one first, in or outside this query. I am not able to understand the point you say "but the skewness wasn't addressed well". How about a volatile table(t) for T1 alone , with the intent of reducing skewness, if and but suiting the end requirement.

Raja K Thaw
My wiki: http://en.wikipedia.org/wiki/User:Kt_raj1
Street Children suffer not by their fault. We can help them if we want.

dnoeth 4628 posts Joined 11/04
24 Jan 2014

Hi Kelvin,
130+ records is not skewed - unless the "+" includes at least 5 to 6 diigt numbers :-)
What's the PI of those tables and number of rows?
Which table is redistributed based on Explain? 
Could you show estimated and actual numbers from QryLogSteps?
Why do you use a Derived Table for the first join, both join to table A?

Dieter

kelvsdotph 14 posts Joined 06/13
27 Jan 2014

Raja, Dieter,
Thank you for responding.  I'll try to explain as much as I can
for Raja:
> data volume, explain, amps etc if any.
Ok. data volumes vary.  I've one query which had the same scenario (indicated above) where TAB1 houses 18m records and TAB2, 10m as to which redistributes to where, I can't remember exactly
for Dieter:
I've tried grouping certain fields, say, columns A and B putting a count(*) to figure out how many repetitions there are in one particular row/record.  
To visualize the contents of TAB1
SEL A, B, COUNT(*) FROM TAB1 GROUP BY 1,2 HAVING COUNT(*) > 1;
A | B | COUNT(*)
------------------------------------
AAA | BBB | 130
CCC | BBB | 2
DDD | XXY | 67
TAB1: Columns A and B are being used to do joining
same query when ran against TAB2 and TAB3 yields no duplicate records for the joining columns.  I'm new to tuning but I know that if there are duplicates , your fastest amp = your slowest amp so if skewness should be avoided at all times (well, at least that's what the manual says :) )
PI info, 
TAB1 PI is not columns A and B, TAB2 & TAB3 uses A and B as NoPI
Could you show estimated and actual numbers from QryLogSteps? - I'll try to get the info here
Why do you use a Derived Table for the first join, both join to table A - To be honest, I don't even know.  I just saw this on the list of queries to tune and I just picked this up.  There are lots of queries having this kind of problem.  If I get lucky, some of the Queries have TAB2 columns which are skewed.  I've done some groupings on it's data and it seemed to work well when joining.  The problem is IF the columns of the driving table is very skewed, then yeah.. I gotta know how to attack this sort of problem.
I know my info seems very less, but what i'm interested to know here is the "style" on how can use query rewrite to address the skewness problem of COL1 and COL2 for TAB1 (driving table)?  I've discussed this with my mates here and what they told me is to look for few more columns on TAB1 which is present in TAB2/3 which can be used to do joining, therefore spreading out the duplicates even further.  I'm just not so agreeable in the statement bec, it's prone to data changes like the change in result set as what Dieter mentioned above.  
K.

Adeel Chaudhry 773 posts Joined 04/08
02 Feb 2014

Sounds like a very nice project.
 
I would start by analyzing each query and noting about underlying tables, data volumes, PIs, PPIs, Joins, Joining columns, Data-types of joining columns, etc.
 
Based on this info i would go from harder-to-simpler cases, as simple cases may not have much optimization required and wont have much chance/room for improvement.

-- If you are stuck at something .... consider it an opportunity to think anew.

dnoeth 4628 posts Joined 11/04
02 Feb 2014

@kelvsdotph:
I would go with a PI on (A,B) for all three tables, this will result in the fastest possible local Merge Join. If your actually concerned about skew you can do a 

SELECT HASHROW(A,B), COUNT(*)
FROM TAB1
GROUP BY 1
HAVING COUNT(*) > 100
ORDER BY 2 DESC

This returns the actual PI-duplicates including hash synonyms. If the Count is less than a few thousand you don't have to worry, just switch to a MULTSET table to avoid duplicate row checks when the count is more than a few hundred.
The rule of thumb is: never care when all rows fit on a single block (64000/avg row size), partitioning also helps reducing rows/block.

Dieter

kelvsdotph 14 posts Joined 06/13
07 Feb 2014

Guys,
Apologies for the late reply .  
I had discussed this with my colleagues and here's what we've came up with:  The problem is that the joining columns are skewed so here's what we had done:
1.  Create a subquery of the driving table, containing the columns that were referenced in the join.  In our case, it's columns A & B.  
2.  Add around 2 more columns to the subquery (step1).  you will have to elect the columns that can give you lower skew factors (more distribution).  Group the records up
SEL A, B, C, D 
FROM TAB1 
GROUP BY 1,2,3,4
3.  Join this subquery with the table that needed to be joined to, in our case above, it's TAB2.
SEL
X.A, X.B, X.C, X.D
(
SEL A, B, C, D 
FROM TAB1 
GROUP BY 1,2,3,4

) X 

INNER JOIN TAB2

   ON X.A = TAB2.A

   AND X.B = TAB2.B

 

the joining will just be okay since the records on the subquery are spreadout (that's from what i understood).  Then using this whole query, join this to the driving table using the columns mentioned in the select clause so, instead of having only two columns (which were originally skewed in the driving table, the added columns made the extra distribution) now we have 4 columns we can use to join

 

SEL 

.. <some columns>

FROM TAB1 INNER JOIN 

(

subquery above

) SUB

ON SUB.A = TAB1.A

AND SUB.B = TAB1.B

AND SUB.C = TAB1.C

AND SUB.D = TAB1.D

 

the idea here is to see the skewness.  The ImpactCPU went down, however, the TotalIOCount went up.  It could be a crude way to solving the skewness of a driving table but it was very clever of my colleague to have thought of this.

 

Hope this would help the TD community.

K.

 

P.S. Dieter, I'll try to read about what you said here.

 

The rule of thumb is: never care when all rows fit on a single block (64000/avg row size), partitioning also helps reducing rows/block.

 

I'm very new to tuning  and know only a few get around, but thanks for it :)

 

Cheers,

K.

Adeel Chaudhry 773 posts Joined 04/08
07 Feb 2014

K, its a good practice to share a solution. Thanks.

-- If you are stuck at something .... consider it an opportunity to think anew.

You must sign in to leave a comment.