All Forums Tools
AndrewT 1 post Joined 04/08
17 Apr 2008
SQL Question

I am working with a VERY large table (appx 15 Billion rows) and I want to pull back only those rows when conditions based on the previous row are met.For example, if the table contained the following rows:Customer Event_Time Event_Type1 1 A 1 6 B 1 7 A 1 9 C 2 2 A1 4 B 1 8 D 2 3 B I would only want to return these two rows:Customer Event_Time Event_Type1 4 B 1 6 B -- in essence, I want to return rows that after being sorted by Customer and Event_Time have the same value in Event_Type. Also, there can be more than two rows consecutive rows with the same Event_Type; in those cases, I need ALL the rows (e.g. if there were another row with Customer=1 and Event_Time=5 and Event_Type=B I would want that row as well).My current approach is to create a table which contains a portion of the data (roughly 1B rows) in sort order and then use SAS to determine which consecutive rows to keep. I know it would be much faster if I could do the row selection in TD rather than bringing it down and processing in SAS.Thanks in advance for any advice.

SN 77 posts Joined 01/07
25 Apr 2008

hi Andrew,here's a way to do it in TD:1. sort the records based on cust & event_tm and assign a incremental sequence number( say SEQ_NUM) for each row** since you may have 15 billion recs, define the SEQ_NUM column as data type 'LONG VARCHAR'create table tab1(cust ***,ev_tm **** ,ev_ty ****,SEQ_NUM LONG VARCHAR) unique primary index(seq_num);2. Insert into this table with the records sorted and sequence number assigned for each row:Ins tab1sel cust,event_tm,event_type,csum(1,cust,ev_tm) as R_NUMfrom 3. the select query should give you the required answerset:sel a.cust,a.event_tm,a.event_type from tab1 a,(sel * from tab1)bwhere a.ev_ty=b.ev_ty and (a.flag=b.flag-1 or a.flag=b.flag+1) ;hope this helps!

SN 77 posts Joined 01/07
25 Apr 2008

a typo correction in myearlier reply:sel a.cust,a.event_tm,a.event_type from tab1 a,(sel * from tab1)bwhere a.ev_ty=b.ev_ty and (a.flag=b.flag-1 or a.flag=b.flag+1) ;*** replace the field name 'flag' with seq_num

tanova535 7 posts Joined 01/06
28 Apr 2008

Just a thought but what about the use of RANK() ? RANKRANK is the most straightforward of the OLAP extensions. It not only sorts a result set but also identifies the numeric rank of each row in the result. RANK has only the sort column(s) as its argument and returns an integer that represents the rank of each row in the result. For example, here’s how the RANK function lets you sort employees alphabetically and identify their level of seniority in the company:SELECT EmployeeName, (HireDate - DATE) AS ServiceDays, RANK( ServiceDays ) as SeniorityFROM EmployeeORDER BY EmployeeName;EmployeeName Service Days SeniorityRobyn Baker 9931 2Nick Garrison 9931 2Kyle McVicker 9408 5Eva O’Malley 10248 1Norma Powers 9409 4And the following example operates on the result of a derived table and join to sort items by category and descending overall rank of revenue:SELECT Category, Item, Revenue, RANK( Revenue ) AS ItemRankFROM ItemCategory, (SELECT Item, Sum( Sales ) as Revenue FROM DailySales GROUP BY Item) AS ItemSalesWHERE ItemCategory.Item = ItemSales.ItemORDER BY Category, ItemRank DESC;Category Item Revenue ItemRankHot Cereal Regular Oatmeal 39112.00 4Hot Cereal Instant Oatmeal 44918.00 3Hot Cereal Regular COW 59813.00 2Hot Cereal Instant COW 75411.00 1

You must sign in to leave a comment.