In my small problem, I have n users and m equipments (m and n ~ 50000). One user can use one and only one equipment at a time.
I have a list of records in this format [u, e, t], with t (time) sorted ascending. Each record mean user u is using equipment e at time t. The number of records is around 500 million. Assume that two nearest records with the same u and e mean that u is using e continuously. For example:
1, 2, 1
3, 4, 1
1, 2, 3
1, 2, 4
1, 2, 5
2, 6, 6
3, 2, 6
3, 2, 8
would mean user 1 uses equipment 2 from 1 to 5.
What i want to do is from this list, infer the shift time in this format: [u, e, st, et] which means user u uses equipment e from start time st to end time et.
Result for the sample data would be:
1, 2, 0, 5
3, 4, 0, 6
3, 2, 6, 8
(assuming time starts from 0 and end at max(t), and when a pair of (u, e) is first seen, u has already started using e since the beginning of time 0. Similar for the last records.)
Given the big list (500 million record) but small enough m and n, how could I do this most efficiently?
@Edit: Possible data inconsistencies:
1: If there’s only 1 record (so no end time) such as the case of [2, 6, 6] in the sample data:
— If that’s the only time user 2 and equipment 6 appear in the data set, then ignore the data point.
— If after that record, user 2 uses another equipment, let say 7 at 10, then 2 uses 6 from 6 to 10.
— If after that record, equipment 6 is used by another user, let say 10 at 11, then 2 uses 6 from 6 to 11.
Define two structures (I know this is Java, but let’s assume a generic algorithm):
Given that a user cannot be using more than one piece of equipment at the same time, you could create an array/vector of
user_records, one for each user (you said this is ~ 50k, so this should be tractable), and an array/vector ofmachine_records, one for each machine. Initialise all elements’idxmembers to -1 (to indicate not currently active).Then every time you encounter an input record, check the state of the corresponding
idxfields in theuser_recordandmachine_recordarrays. There are three possibilities:start_timein each one.idxfields back to -1.This is O(N) time (where N is the number of input records).
Note: The output will be sorted by end-times.