Aster nPath functionality (Volume 1)
This presentation will cover some of the finer points on Aster’s nPath function. Here’s the overview:
- Working with ACCUMULATE
- Understanding MODE (overlapping | nonoverlapping)
- Experimenting with Operator precedence and parentheses
- Number of occurrences options
- Using Multiple WHERE clauses and composite PATTERNs
If you haven’t read Teradata Mike’s Aster nPath Guide first, now is a good time. It provides a great foundation of these concepts and is compulsory before we dig into the more advanced stuff. However, at the risk of redundancy, here’s the nPath Operators we'll be using in our labs.
Operators |
Definition |
. |
Is followed by. Sometimes called the Cascade operator. The expression A.B means ‘A followed by B’ |
| |
Or. Sometimes called Alternative operator |
? |
Occurs at most once (0 to 1 occurrence) |
* |
Occurs zero or more times ( >=0 occurrence) |
+ |
Occurs at least once (>=1 occurrence) |
It worth noting the Operators above are 'greedy' in nPath. This means they are matched from left-to-right and as many rows as possible that can fit in the wildcards will be. You will see examples of this below.
1. Working with ACCUMULATE
We will be using the ACCUMULATE aggregate in our RESULTS clause so it’s important we understand how it works (Note this keyword can be used in a number of Aster functions so it’s not just limited to nPath). ACCUMULATE provides a list of input columns that will be passed as-is to the result set when conditions are True for that Pattern. It will be in the form of a comma delimited string of sequenced values enclosed in brackets. I use it for troubleshooting when I want to understand the nPath logic. Suppose you have the following table:
c1 | c2 -----|----- 1 | a 1 | a 1 | b 1 | b 1 | c 1 | c 2 | a 2 | a 2 | b 2 | b 2 | c 2 | c
You run the following nPath code:
select * from npath (on (select * from overlap)partition by c1 order by c2 mode (nonoverlapping) pattern ('a.b') symbols (c2='a' as A, c2='b' as B) -- using lower case to match data result (count(* of any(A,B)) as Ct_Any_AB, accumulate(c2 of any(A,B))));
The result set is:
ct_any_ab | accumulate
---------------|------------------
2 | [a, b]
2 | [a, b]
Basically nPath starts walking the partitioned ordered rows in top-to-bottom fashion looking for True conditions to return in the answer set. In our example, the nPath mechanics works as follows:
Row 3 – False (no match between row3 and row4). Move on to row 4
Row 4 – True (found match between row 4 and row 5) with RESULTS = 2 [a, b]. Move on to row 6
Row 6 – False (no match between row 6 and row 7). Move on to row 7 ......
I think you get the idea by now. Eventually row 10 and row 11 would result in another True which would return the second 2 [a, b].
So that’s a quick overview of how nPATH’s ACCUMULATE works. I don’t want to muddy the waters, but I would be amiss without mentioning a few things in the above example.
- Unlike Teradata, Aster is case-specific. If I would have replaced the SYMBOLS line with this: symbols(c2='A' as A, c2='B' as B) then the result set would have been empty since there would have been no matches. So be extremely careful with case in your code; and for that matter when INSERTING data
- Notice in the sequence above that once a True was found in Row 2, the next row number to be checked was Row 4, and not Row 3. That was due to the mode(nonoverlapping) clause. We delve into this topic next but just wanted to point it out
OK, we’ve got our feet wet. It’s time to dive into some more detailed examples. Let’s go.
2. Understanding mode (OVERLAPPING | NONOVERLAPPING)
The MODE clause indicates whether the matched PATTERNs may overlap. You get to choose between OVERLAPPING or NONOVERLAPPING as shown below.
select * from npath (on (select * from overlap ) partition by c1 order by c2 mode (NONOVERLAPPING | OVERLAPPING) -- only get to pick one pattern ('A+.B.B.C') symbols(c2='a' as A,c2='b' as B,c2='c' as C) result (first(c1 of ANY(A,B,C)), accumulate(c2 of ANY(A,B,C))));
In NONOVERLAP match mode, nPath begins the next pattern search at the row that follows the last PATTERN match (in our case, next pattern search would start at row 8).
In OVERLAPPING match mode, nPath finds every occurrence of the pattern in the partition, regardless of whether it might have been part of a previously found match. This means that, in OVERLAPPING mode, one row can match multiple symbols in a given matched PATTERN (in this case the next pattern match starts at row 4).
But don’t take my word for it (I’m from Missouri so you have to ‘Show Me’). Let’s do the lab using the same 12-row data set we used earlier. Since my pattern is ('A+.B.B.C'), this means we are looking for a PATTERN of at least 1 A, followed by 2 B’s, followed by a C.
With the code running mode (NONOVERLAPPING) , here’s the output.
first | accumulate
----------|-----------------------
2 | [a, a, b, b, c]
1 | [a, a, b, b, c]
Repeat, only this time using mode (OVERLAPPING) you get 4 rows for the output.
first | accumulate
-----------|-----------------------
2 | [a, b, b, c]
2 | [a , a, b, b, c]
1 | [a, b, b, c]
1 | [a , a, b, b, c]
Of course, OVERLAPPING will not guarantee you have more output rows than NONOVERLAPPING. It all depends on the data. Suppose I were to run the above code with a slightly different pattern as follows:
pattern ('A.B.B.C') then I would get the same answer regardless of mode type since there’s only one match in sequence.
accumulate
----------------------
[a, b, b, c]
3. Experimenting with Operator precedence and Parentheses
It is crucial to know the precedence of Operators from highest to lowest. Here it is:
- Cascade ( . )
- Alternate ( | )
- Frequency ( ?, *, + )
If you wish to change order of execution, parenthesis is the way to do it.
Here’s the table we’ll be running our queries against:
I want to create a query that finds all combinations of A,B or A,C. Looking at the table above, I figured I would get 3 hits (Row 1-2, Row 4-5, Row 6-7). Here's the code:
select * from npath (on (select * from npathBetween) partition by c2 order by c1 mode (nonoverlapping) pattern ('A.B|C') -- notice the lack of parens here symbols(c3='A' as A, c3='B' as B, c3='C' as C) result(count(* of A) as Ct_A, accumulate(c3 of any(A,B)) as Any_AB, accumulate(c3 of any(A,C)) as Any_AC));
Hmmm. I only had 2 hits (A,B) and was missing the (A,C) hit. I could not understand why I did not have the A,C combination in my result set. After going back to the User’s Guide, I figured it out. I had forgotten that ( . ) takes precedence over ( | ) . So my logic was looking for PATTERNS of A.B rows or C rows as follows:
Row 1-2 is A.B Match (A.B)
Row 3 is C Match (C)
Row 4-5 is A.B Match (A.B)
Row 6 is A (with no following B) No Match
Row 7 is C Match (C)
… when I was really wanted PATTERNS of A.B or A.C. To fix the problem, I added paretheses to the code:
select * from npath (on (select * from npathBetween) partition by c2 order by c1 mode (nonoverlapping) pattern ('(A.(B|C))') -- add Parens symbols(c3='A' as A,c3='B' as B,c3='C' as C) result(count(* of A) as Ct_A, -- Ct=2 count(* of C) as Ct_C, -- Ct=2 accumulate(c3 of any(A,B)) as Any_AB, accumulate(c3 of any(A,C)) as Any_AC));
Sure enough I got my 3 hits. 2 hits for A,B pattern and 1 for A,C pattern.
Ahh, that’s more like it. Here you can see the 3 hits of 2 A,B and 1 A,C. The bottom line; it is a best practice to use parentheses in your PATTERN string. It takes away any ambiguity that may exist and makes it more readable for others looking at your code.
4. Number of Occurrence options
If you wish to search for an exact number of occurrences that totals more than one, the nPath coding can become tedious. There are 3 options you can use:
- ' (x){a} ' Exactly A number of occurrences of X
- ' (x){a,} ' At least A number of occurrences of X
- ' (x){a,b} ' A to B occurrences of X
For example, suppose you wish to search for sub-sequence of (A.B|C) at least 4 times. You could code it as: PATTERN ( 'X. (Y.Z) . (A.B|C) . (A.B|C) . (A.B|C) . (A.B|C) * ' ) but that’s pretty messy.
An alternative is the following syntax: PATTERN ( 'X. (Y.Z) . (A.B|C) {4,} ' )
The only thing to be aware of is the need for your occurring PATTERN to be in enclosed in parentheses with the single tics on the outside of the entire PATTERN. Here’s an example of doing it the right way: PATTERN ('X.(A.(B|C)){1,3}')
5. Using multiple WHERE clauses and Composite PATTERNs
It is possible to have multiple WHERE clauses in your nPath code. Their position in the statement determines if they filter INPUT rows or OUTPUT rows.
The below WHERE clause in the second SELECT filters INPUT rows where c1 = 1 or c1 =2. That’s simple enough.
select * from npath (on (select * from overlap WHERE c1 in (1,2)) partition by c1 order by c2 mode (overlapping) pattern ('A?.B?.C?') symbols (c2='a' as A,c2='b' as B,c2='c' as C) result(first(c1 of ANY(A,B,C)), accumulate(c2 of ANY(A,B,C))));
I'm a big believer in using WHERE to filter an many Input rows as possible. It's a much better performer than scanning the entire Table's rows.
If you wanted to later filter Output rows, you can do that with a 2^{nd}WHERE clause as demonstrated.
select * from npath (on (select * from overlap WHERE c1 in (1,2)) partition by c1 order by c2 mode (overlapping) pattern ('A?.B?.C?') symbols (c2='a' as A, c2='b' as B, c2='c' as C) result(first(c1 of ANY(A,B,C)), accumulate(c2 of ANY(A,B,C))as After_nPath)) WHERE After_nPath = ('[a, b]') or After_nPath = ('[b, c]');
Note I could have accomplished the same thing by having composite PATTERN search instead of using the 2nd WHERE clause. Following code produces the same output as above.
select * from npath (on (select * from overlap WHERE c1 in (1,2)) partition by c1 order by c2 mode (overlapping) pattern ('(A.B) | (B.C)') symbols (c2='a' as A, c2='b' as B, c2='c' as C) result(first(c1 of ANY(A,B,C)), accumulate(c2 of ANY(A,B,C))));
Ensure you have space after the comma ( ie: [a, b] ). If you write it as [a,b] , it will return 0 rows.
Be extremely careful when writing composite PATTERNS as I have notice (what I think) is a bug in the code. If you were to write as: PATTERN('A.B' | 'B.C'), you get back a message along the lines 'The connection to the server was lost.' You will not get a result set until you fix the code. I believe this is because you must have only one set of single tics in PATTERN.
In Conclusion
As you can see, there’s lot of functionality baked into the nPath. I consider it one of Aster’s premier functions since it has so many uses and can provide deep insights into customer’s behavior such as buying trends and patterns.
Keep your eyes peeled for more nPath articles. In particular, we’ll be talking about the LAG function next which can be embedded in SYMBOLS clause. It can answer such questions as: “Find customers who view a product and then buy it with 1 minute on my web page.” That would be a valuable piece of information if I were a retailer.
And if you are interested in running these labs now, here's the CREATE TABLE statment and INSERTs to help you along:
create table overlap (c1 int, c2 text) distribute by hash(c1);
insert into overlap values (1, 'a');
insert into overlap values (1, 'a');
insert into overlap values (1, 'b');
insert into overlap values (1, 'b');
insert into overlap values (1, 'c');
insert into overlap values (1, 'c');
insert into overlap values (2, 'a');
insert into overlap values (2, 'a');
insert into overlap values (2, 'b');
insert into overlap values (2, 'b');
insert into overlap values (2, 'c');
insert into overlap values (2, 'c');
Hi Mark,
My customer has just setup a Terdata Aster POC environement. I have fairly good knowledge about Teradata working as a software engineer in Teradata Labs in the past and now as a full time DBA but I am pretty new to Aster Technology. Could you please suggest some reference document as a starting point to understand Aster Architecture. Is there anything similar to Teradata reference documentation in Aster?
Thanks,
Ravi
Ravi Singh