Why do you create a three column UPI if customer_id is already unique?
Simply use UNIQUE PRIMARY INDEX (customer_id) and there's no redistribution needed.
Did you create the base table?
You might try changing the PI, too, can you add some demographics for customer_id, product, country?
Dieter
thanks Dieter,
I added the secondary index including gender, as I later need to GROUP BY gender and get a COUNT(*). I assumed that would need to be indexed, so I added it to the secondary index. Should I not do that?
Yes, I did create the base table as well. But it already has a UNIQUE PRIMARY INDEX (customer_id, product, country) etc. What are you recommending that I change?
regards,
imran
Hi imran,
well, there's no secondary index in your table definition?
Stay away from defining UPIs on all your tables, you'll get perfectly distributed tables and a horrible performance.
That's why I asked for additional info, you might be able to switch to a NUPI which matches your queries (e.g. PI(customer_id) for improving GROUP BY performance), or you can add additional partitioning.
<cut & paste>
Criteria for Choosing the Primary Index
Access = Maximize single-AMP operations
- Choose the column(s) most frequently used for JOIN (and WHERE conditions), usually a Primary or Foreign Key
Distribution = Optimize parallel processing
- Choose the column(s) that provides good distribution
- Better when more unique values and less rows per value
- Adding more columns might result in better distribution but less
usability: The RowHash can only be calculated when all values are
exactly known
Volatility = Reduce maintenance resource overhead (I/O)
- Choose a column with stable data values
- An UPDATE of the PI column effectively results in:
DELETE row – send row to new AMP – INSERT row (plus maintaining all
Secondary/Join Indexes)
The Primary Key of a table is always a perfect fit based on distribution
and volatility, as it's unique and stable.
But access is the most important criterion and a PK might not be
accessed at all.
So data distribution has to be balanced with access usage in choosing a PI.
<\cut & paste>
Dieter
thanks Dieter,
I've now changed to a non-unique primary index, with a MULTISET table, and partitioning by month on the date range.
Doing the aggregate statistics is pretty fast (~ 3 seconds) on about 800 million rows of data with a few filters applied.
What takes longer is figuring out the first row chronologically of each customer that matches the filters, in the selected date range. It's possible for a customer to purchase a particular item on two separate days, so right now they will be represented as two rows in the table. For a metric like % of customers, I need only one row per customer, and would like to use the first row of each customer (by date).
I've tried the standard GROUP BY subquery:
SELECT
AVG(age) AS age,
AVG(spend) AS spend
FROM rtl.base t1
INNER JOIN
(
SELECT
customer_id,
MIN(dt) AS dt
FROM
rtl.base
WHERE
country='China'
AND
product='cereal'
AND
dt BETWEEN '2015-01-01' AND '2016-01-01'
GROUP BY
customer_id
) ON (t1.customer_id = t2.customer_id AND t1.dt = t2.dt);
Also tried ROW_NUMBER() OVER(PARTITION BY customer_id ORDER BY date ASC)..... QUALIFY row_no = 1
But it takes quite a bit of time (< 30 seconds) on about 3 million rows. Any suggestions on how to speed this up?
Would it help if I added a second level of partitioning on customer_id, so that each AMP would have all the rows of data it needs for the GROUP BY locally?
thanks
Partitioning does not change the distribution of rows, just the internal order, distribution is only based on PI. What's your current PI?
I usually assume that OLAP-functions are more efficient, you mean this query was performing slower?
SELECT AVG(age) AS age, AVG(spend) AS spend FROM ( SELECT age, spend FROM rtl.base WHERE country='China' AND product='cereal' AND dt BETWEEN '2015-01-01' AND '2016-01-01' QUALIFY ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY date ASC) = 1 ) AS dt
Dieter
That actually seems to have slowed down the query even further, it now runs in 60 seconds.
This is how I partitioned it:
DATA PRIMARY INDEX (customer_id, dt, country, product)
PARTITION BY (
RANGE_N(dt BETWEEN '2000-04-01' AND '2016-04-05' EACH INTERVAL '1' MONTH),
RANGE_N(customer_id BETWEEN 1 AND 1000000000 EACH 2000000));
The explain statement says it should only take 3 seconds to run.
Even after collecting stats on the base table for columns customer_id and date, the performance remains slow (44 seconds).
Any suggestions on improvement?
Statistics will hardly help, just the estimates might be better, but the actual resource usage wil not change.
In therory the fastest way should be
PRIMARY INDEX (customer_id) PARTITION BY ( RANGE_N(dt BETWEEN '2000-04-01' AND '2016-12-31' EACH INTERVAL '1' MONTH)
Depending on the actual data this might result in a skewed table. What's the average and maximum number of rows per customer_id/month?
Dieter
Hi Dieter! Is this a one-man support forum?
The average is 1, the max is 55.
thanks,
imran
If you only got a two-digit number of rows per PPI it should be safe to to switch to
PRIMARY INDEX (customer_id) PARTITION BY ( RANGE_N(dt BETWEEN '2000-04-01' AND '2016-12-31' -- go as far into the future as needed to avoid ADD partition EACH INTERVAL '1' MONTH)
No USI.
If the source data is known to be unique a MULTISET table will speed up the Insert/Select.
Dieter
thanks Dieter.
Doing it this way will still allow me to filter quickly, say on country, even without a secondary index? How does that work?
It's not unique, as a single customer may have multiple rows.
regards,
imran
Hi all,
I have a small query's on some points.
Any one can assit me
regards,
Hammam
This is my query:
It currently takes about 10 seconds to run, and I would like to bring it down to 2 seconds. The rtl.base table is partitioned on date (every 7 days) and has a primary index on customer_id, product, country, date (called dt). I have collected statistics on the partition and the age column.
This is the explain:
thanks!