All Forums General
Seitarin 3 posts Joined 05/16
03 May 2016
Looking for help with regards to nested aggregates

Hi all,
I am new to Teradata and I am having trouble building a query, would appreciate all the help I can get thanks! I'm looking to identify the highest difference in value between people entering and exiting a rail station, Location A (over the course of a day) and finding out the average of this value (daily) over all weekdays in a certain month (e.g. October).
I have this example table below where it shows the timestamp, where their journey originates from (as they tap into the station) and where their destination is (as they tap out). They all have unique IDs based on their card used to tap in.

Date	Time_IN	ORIG_LOC	DEST_LOC	CRD_NUM
01/10/2015	8:40	Location A	Location B	1234567
01/10/2015	8:40	Location A	Location B	390480
01/10/2015	8:41	Location A	Location C	9133244
01/10/2015	8:42	Location A	Location D	1239293
01/10/2015	8:43	Location D	Location A	31323552
01/10/2015	8:44	Location F	Location A	12145353
01/10/2015	8:45	Location A	Location B	35778655
01/10/2015	8:46	Location C	Location A	3930305

 
So at 8:45, there have been 5 people who entered the station and 2 people who exited the station (difference in value of 3). At 8:46, 5 people entered and 3 people exited (difference in value of 2)., so on and so forth. What I want to find out is over the course of the entire day 01/10/2015, what is the highest difference in value between entries and exits, and obtaining this value for all weekdays in October and finding out the average. 
 
All help appreciated, thanks!! :)

dnoeth 4628 posts Joined 11/04
04 May 2016

London Oyster Card?  :-)
You need another nesting level for each nested aggregation: avg(max(count)))

select acg(maxdiff)
from
 (
   select date, max(diff) as maxdiff
   from
    (
      select date, time_in, 
          sum(case when DEST_LOC = 'Location A' then 1 else 0 end) -- exit
        - sum(case when ORIC_LOC = 'Location A' then 1 else 0 end) -- enter
        AS diff
      from tab
      where (ORIC_LOC = 'Location A' or DEST_LOC = 'Location A')
        and date between DATE '2015-10-01' and DATE '2015-10-30' -- October 2015
      group by date, time_in
    ) as dt
 ) as dt

 

Dieter

Seitarin 3 posts Joined 05/16
04 May 2016

Hi dnoeth, thanks for the help! I realised that my column header 'Date' is actually a reserved word, hence the coding might run into a few errors. Also, I realised that I forgot to include another column into my header which shows the exit timings for people exiting from Location A, which is needed instead of using the entry time (at their different locations).
 
Here's the updated table below:

B_DT	Time_Out	ORIG_LOC	Time_Out	DEST_LOC	CRD_NUM
01/10/2015	8:40	Location A	8:59	Location B	1234567
01/10/2015	8:40	Location A	9:02	Location B	390480
01/10/2015	8:41	Location A	9:05	Location C	9133244
01/10/2015	8:42	Location A	9:20	Location D	1239293
01/10/2015	8:43	Location D	9:25	Location A	31323552
01/10/2015	8:44	Location F	9:11	Location A	12145353
01/10/2015	8:45	Location A	9:01	Location B	35778655
01/10/2015	8:46	Location C	9:02	Location A	3930305

Also, just out of curiosity, when 'case' is being used, it sort of acts like a scenario and can be used as a substitute for the usual sum(count())? This is because i realised that in your code, you didn't use sum(count(crd_num)) at all, which I thought would have been required to calculate the maxdiff. 
Appreciate your help, thanks!! :)

dnoeth 4628 posts Joined 11/04
04 May 2016

So this is the base data, no pre-calculation, yet?
You need another Derived Table to combine the times for enter and exit:

SELECT AVG(maxdiff)
FROM
 (
   SELECT B_DT, MAX(diff) AS maxdiff
   FROM
    (
      SELECT B_DT, time_in, SUM(out_) - SUM(in_) AS diff
      FROM
       ( -- combine both enter and exit
         SELECT B_DT, time_in, 1 AS in_, 0 AS out_
         FROM tab
         WHERE ORIC_LOC = 'Location A' 
           AND B_DT BETWEEN DATE '2015-10-01' AND DATE '2015-10-30' 
         UNION ALL
         SELECT B_DT, time_out, 0 AS in_, 1 AS out_
         FROM tab
         WHERE DEST_LOC = 'Location A' 
           AND B_DT BETWEEN DATE '2015-10-01' AND DATE '2015-10-30'
       ) AS dt
    ) AS dt
   GROUP BY B_DT
 ) AS dt

 
An aggregate over a CASE is called conditional aggregation, it's like different WHERE-conditions within the same query.
And the SUM(CASE) emulates a COUNT, summing ones is like counting.
 

Dieter

Seitarin 3 posts Joined 05/16
04 May 2016

Hmm just curious, why the need to insert the UNION ALL command? Having this command would stack the out_ and in_ data into a single table? Would it be easier if both datasets were separated into 2 different tables and then tabulated accordingly based on b_dt? 
I am also getting an error message of "selected non-aggregated values should be included in the associated group" when i try to run this script. Am I supposed to include time_in and time_out in the GROUP BY command?
Thanks!

jainayush007 50 posts Joined 03/11
09 Aug 2016

Hello, I need help as well on a case for nested aggregation as well.In this sample data, I need to find avg of field D over C and then sum them up over A in one query. Can it be done without sub query or derived table? 
A B C D
C1 R1 I1 1
C1 R1 I2 10
C1 R1 I3 2
C1 R1 I1 2
C1 R2 I3 3
C2 R3 I4 2
C2 R3 I3 4
C2 R3 I4 4
C2 R4 I6 5
C2 R4 I2 15
C3 R5 I1 3
C3 R5 I5 4
C3 R6 I6 7
C3 R6 I1 1.5
C3 R6 I6 10
C3 R6 I1 2.5
 
Result
 
Group by A COUNT DISTINCT B COUNT DISTINCT C SUM(AVG(D) PARTITION BY(A,C))
C1 2 3 14
C2 2 4 27
C3 2 3 14.83333333

jainayush007 50 posts Joined 03/11
09 Aug 2016

Alternatively, if we have to do a sub query, is that a way we can achieve a result set as below - 
Find Avg within (A,C) and populate only for the first occurance of an I<> in a C<> and then sum it up in the outer query?
A B C D AVG(D) PARTITION BY(A,C)

C1 R1 I1 1 1.5

C1 R1 I2 10 10

C1 R1 I3 2 2.5

C1 R1 I1 2

C1 R2 I3 3

C2 R3 I4 2 3

C2 R3 I3 4 4

C2 R3 I4 4

C2 R4 I6 5 5

C2 R4 I2 15 15

C3 R5 I1 3 2.333333333

C3 R5 I5 4 4

C3 R6 I6 7 8.5

C3 R6 I1 1.5

C3 R6 I6 10

C3 R6 I1 2.5

 

 

Group by A COUNT DISTINCT B COUNT DISTINCT C SUM

C1 2 3 14

C2 2 4 27

C3 2 3 14.83333333

 

jainayush007 50 posts Joined 03/11
11 Aug 2016

Hi Dieter... Any help on the above would be helpful....

yuvaevergreen 93 posts Joined 07/09
11 Aug 2016
Is this what you are expecting?


SEL A,COUNT(DISTINCT B),COUNT(DISTINCT C),SUM (SM1) FROM 
(SEL A,B,C,
CAST(AVG(D) OVER (PARTITION BY A,C ORDER BY A ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) 
AS DECIMAL(2,1)) AS SM1
FROM TABLE
) A
GROUP BY 1

 

jainayush007 50 posts Joined 03/11
11 Aug 2016

Hello.. Cant have a sub query.. actually this needs to be executed on a reporting solution and it cant allow sub queries. Also,  when you we do AVG(D) OVER (PARTITION BY A,C ORDER BY A ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
it would cause the avg to be across all the records and when summed up it would give wrong results.

jainayush007 50 posts Joined 03/11
12 Aug 2016

Hello.. Cant have a sub query.. actually this needs to be executed on a reporting solution and it cant allow sub queries. Also,  when you we do AVG(D) OVER (PARTITION BY A,C ORDER BY A ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
it would cause the avg to be across all the records and when summed up it would give wrong results.
Referring my second post above should help.

dnoeth 4628 posts Joined 11/04
12 Aug 2016

Sorry, but I don't get what you're trying to do with your sample data and expected result.
 
If your "reporting solution" doesn't support subqueries/Derived Table you better dump it.

Dieter

You must sign in to leave a comment.