I have records of type:
time | url
==========
34 google.com
42 cnn.com
54 yahoo.com
64 fb.com
I want to add another column to these records time_diff which basically takes the difference of the time of the current record with the previous record. Output should look like:
time | url | time_diff
======================
34 google.com -- <can drop this row>
42 cnn.com 08
54 yahoo.com 12
64 fb.com 10
If I can somehow add another column (same as time) shifting the time by one such that 42 is aligned with 34, 54 is aligned with 42 and so on, then I can take the difference between these columns to calculate time_diff column.
I can project the time column to a new variable T and if I can drop the first record in the original data, then I can join it with T to obtain the desired result.
I appreciate any help. Thanks!
See this question, for example. You’ll need to get your tuples in a bag (using
GROUP ... ALLin your case), and then in a nestedFOREACH,ORDERthem and call a UDF to rank them. After you have this rank, you canFLATTENthe bag back out into a set of tuples again, and you’ll have three fields:time,url, andrank. Once you have this, create a fourth column which isrank-1, do a self-join on those latter two columns, and you’ll have what you need to compute thetime_diff.Since multiple records can have the same
time, it would be a good idea to also sort onurlso that you are guaranteed the same result every time.