All Forums Database
skunkwerk 17 posts Joined 08/15
30 Mar 2016
help optimizing GROUP BY query

This is my query:

CREATE TABLE rtl.intermediate AS (
SELECT
   customer_id,
   MAX(new_to) AS new_to, 
   MIN(age) AS age,
   MIN(gender) AS gender,
   MIN(existing) AS existing
FROM rtl.base
WHERE
   country='China'
AND
   product='cereal'
AND
   dt BETWEEN '2015-01-01' AND '2016-01-01'
GROUP BY customer_id
) WITH DATA 
UNIQUE PRIMARY INDEX (customer_id, new_to, gender);

It currently takes about 10 seconds to run, and I would like to bring it down to 2 seconds.  The rtl.base table is partitioned on date (every 7 days) and has a primary index on customer_id, product, country, date (called dt).  I have collected statistics on the partition and the age column.
 
This is the explain:

1) First, we lock a distinct rtl."pseudo table" for read on a
RowHash to prevent global deadlock for
rtl.base.
2) Next, we lock rtl.intermediate for
exclusive use, and we lock rtl.base for read.
3) We lock a distinct DBC."pseudo table" for read on a RowHash for
deadlock prevention.
4) We lock DBC.DBase for read on a RowHash.
5) We do a single-AMP ABORT test from DBC.DBase by way of the unique
primary index "Field_1 = 'rtl'" with a residual condition of (
"'0000BF0A'XB= DBC.DBase.Field_2").
6) We create the table header.
7) We do an all-AMPs SUM step to aggregate from 53 partitions of
rtl.base with a condition of (
"(rtl.base.dt >= DATE '2015-01-01') AND
((rtl.base.dt <= DATE '2016-01-01') AND
((rtl.base.country = 'CHN') AND
(rtl.base.product = 'cereal')))")
, grouping by field1 ( rtl.base.customer_id).
Aggregate Intermediate Results are computed globally, then placed
in Spool 3. The size of Spool 3 is estimated with no confidence
to be 8,142,324 rows (293,123,664 bytes). The estimated time for
this step is 0.28 seconds.
8) We do an all-AMPs RETRIEVE step from Spool 3 (Last Use) by way of
an all-rows scan into Spool 1 (all_amps), which is redistributed
by the hash code of (rtl.base.customer_id,
rtl.base.new_to,
rtl.base.gender) to all AMPs. Then we do a
SORT to order Spool 1 by row hash. The size of Spool 1 is
estimated with no confidence to be 8,142,324 rows (227,985,072
bytes). The estimated time for this step is 0.15 seconds.
9) We do an all-AMPs MERGE into
rtl.intermediate from Spool 1 (Last Use).
The size is estimated with no confidence to be 8,142,324 rows.
The estimated time for this step is 1 minute and 27 seconds.
10) We lock a distinct DBC."pseudo table" for write on a RowHash for
deadlock prevention, we lock a distinct DBC."pseudo table" for
write on a RowHash for deadlock prevention, and we lock a distinct
DBC."pseudo table" for write on a RowHash for deadlock prevention.
11) We lock DBC.Indexes for write on a RowHash, we lock DBC.TVFields
for write on a RowHash, we lock DBC.TVM for write on a RowHash,
and we lock DBC.AccessRights for write on a RowHash.
12) We execute the following steps in parallel.
1) We do a single-AMP ABORT test from DBC.TVM by way of the
unique primary index "Field_1 = '0000BF0A'XB, Field_2 =
'INTERMEDIATE'".
2) We do an INSERT into DBC.Indexes (no lock required).
3) We do an INSERT into DBC.Indexes (no lock required).
4) We do an INSERT into DBC.Indexes (no lock required).
5) We do an INSERT into DBC.TVFields (no lock required).
6) We do an INSERT into DBC.TVFields (no lock required).
7) We do an INSERT into DBC.TVFields (no lock required).
8) We do an INSERT into DBC.TVFields (no lock required).
9) We do an INSERT into DBC.TVFields (no lock required).
10) We do an INSERT into DBC.TVM (no lock required).
11) We INSERT default rights to DBC.AccessRights for
rtl.intermediate.
13) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
-> No rows are returned to the user as the result of statement 1.

thanks!

dnoeth 4628 posts Joined 11/04
31 Mar 2016

Why do you create a three column UPI if customer_id is already unique? 
Simply use UNIQUE PRIMARY INDEX (customer_id) and there's no redistribution needed. 
 
Did you create the base table?
You might try changing the PI, too, can you add some demographics for customer_id, product, country?

Dieter

skunkwerk 17 posts Joined 08/15
01 Apr 2016

thanks Dieter,
I added the secondary index including gender, as I later need to GROUP BY gender and get a COUNT(*).  I assumed that would need to be indexed, so I added it to the secondary index.  Should I not do that?
Yes, I did create the base table as well.  But it already has a UNIQUE PRIMARY INDEX (customer_id, product, country) etc.  What are you recommending that I change?
 
regards,
imran

dnoeth 4628 posts Joined 11/04
01 Apr 2016

Hi imran,
well, there's no secondary index in your table definition?
Stay away from defining UPIs on all your tables, you'll get perfectly distributed tables and a horrible performance.
That's why I asked for additional info, you might be able to switch to a NUPI which matches your queries (e.g. PI(customer_id) for improving GROUP BY performance), or you can add additional partitioning.
<cut & paste>
Criteria for Choosing the Primary Index

Access = Maximize single-AMP operations

- Choose the column(s) most frequently used for JOIN (and WHERE conditions), usually a Primary or Foreign Key

 

Distribution = Optimize parallel processing

- Choose the column(s) that provides good distribution

- Better when more unique values and less rows per value

- Adding more columns might result in better distribution but less

usability: The RowHash can only be calculated when all values are

exactly known

 

Volatility = Reduce maintenance resource overhead (I/O)

- Choose a column with stable data values

- An UPDATE of the PI column effectively results in:

   DELETE row – send row to new AMP – INSERT row (plus maintaining all 

Secondary/Join Indexes)

 

The Primary Key of a table is always a perfect fit based on distribution 

and volatility, as it's unique and stable.

 

But access is the most important criterion and a PK might not be 

accessed at all.

 

So data distribution has to be balanced with access usage in choosing a PI.

<\cut & paste>

Dieter

skunkwerk 17 posts Joined 08/15
05 Apr 2016

thanks Dieter,
I've now changed to a non-unique primary index, with a MULTISET table, and partitioning by month on the date range.
Doing the aggregate statistics is pretty fast (~ 3 seconds) on about 800 million rows of data with a few filters applied.
What takes longer is figuring out the first row chronologically of each customer that matches the filters, in the selected date range.  It's possible for a customer to purchase a particular item on two separate days, so right now they will be represented as two rows in the table.  For a metric like % of customers, I need only one row per customer, and would like to use the first row of each customer (by date).
I've tried the standard GROUP BY subquery:

SELECT

   AVG(age) AS age,

   AVG(spend) AS spend

FROM rtl.base t1

INNER JOIN

(

SELECT

customer_id,

MIN(dt) AS dt

FROM

rtl.base

WHERE

    country='China'

AND

    product='cereal'

AND

    dt BETWEEN '2015-01-01' AND '2016-01-01'

GROUP BY

customer_id

) ON (t1.customer_id = t2.customer_id AND t1.dt = t2.dt);

 
Also tried ROW_NUMBER() OVER(PARTITION BY customer_id ORDER BY date ASC)..... QUALIFY row_no = 1
But it takes quite a bit of time (< 30 seconds) on about 3 million rows.  Any suggestions on how to speed this up?

skunkwerk 17 posts Joined 08/15
06 Apr 2016

Would it help if I added a second level of partitioning on customer_id, so that each AMP would have all the rows of data it needs for the GROUP BY locally?
 
thanks

dnoeth 4628 posts Joined 11/04
06 Apr 2016

Partitioning does not change the distribution of rows, just the internal order, distribution is only based on PI. What's your current PI?
I usually assume that OLAP-functions are more efficient, you mean this query was performing slower?

SELECT
   AVG(age) AS age,
   AVG(spend) AS spend
FROM 
 ( 
   SELECT
      age, spend
   FROM
      rtl.base
   WHERE
       country='China'
   AND
       product='cereal'
   AND
       dt BETWEEN '2015-01-01' AND '2016-01-01'
   QUALIFY
       ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY date ASC) = 1
 ) AS dt

 

Dieter

skunkwerk 17 posts Joined 08/15
06 Apr 2016

That actually seems to have slowed down the query even further, it now runs in 60 seconds.
This is how I partitioned it:

DATA PRIMARY INDEX (customer_id, dt, country, product)

  PARTITION BY (

    RANGE_N(dt BETWEEN '2000-04-01' AND '2016-04-05' EACH INTERVAL '1' MONTH),

    RANGE_N(customer_id BETWEEN 1 AND 1000000000 EACH 2000000));

 

The explain statement says it should only take 3 seconds to run.

 

Even after collecting stats on the base table for columns customer_id and date, the performance remains slow (44 seconds).

 

Any suggestions on improvement?

dnoeth 4628 posts Joined 11/04
06 Apr 2016

Statistics will hardly help, just the estimates might be better, but the actual resource usage wil not change.
In therory the fastest way should be  

PRIMARY INDEX (customer_id)
  PARTITION BY (
    RANGE_N(dt BETWEEN '2000-04-01' AND '2016-12-31' EACH INTERVAL '1' MONTH)

Depending on the actual data this might result in a skewed table. What's the average and maximum number of rows per customer_id/month?

 

Dieter

skunkwerk 17 posts Joined 08/15
07 Apr 2016

Hi Dieter!  Is this a one-man support forum?
The average is 1, the max is 55.
 
thanks,
imran

dnoeth 4628 posts Joined 11/04
08 Apr 2016

If you only got a two-digit number of rows per PPI it should be safe to to switch to 

PRIMARY INDEX (customer_id)
PARTITION BY (
    RANGE_N(dt BETWEEN '2000-04-01'
                   AND '2016-12-31' -- go as far into the future as needed to avoid ADD partition
            EACH INTERVAL '1' MONTH)

No USI.
If the source data is known to be unique a MULTISET table will speed up the Insert/Select.

Dieter

skunkwerk 17 posts Joined 08/15
08 Apr 2016

thanks Dieter.
Doing it this way will still allow me to filter quickly, say on country, even without a secondary index?  How does that work?
It's not unique, as a single customer may have multiple rows.
regards,
imran

hammam1983 1 post Joined 04/16
11 Apr 2016

Hi all,
 
I have a small query's on some points.
 
Any one can assit me
regards,
Hammam

You must sign in to leave a comment.