How to find visited pages for a particular user from a big log file that contains list
of sessionId and PageId combination in each separate line?
File is big enough not to fit in memory. It means find out page that is being visited most in same session(user).
for e.g.
My file is (order is sessionId, PageID)
usera page1
userb page2
userb page1
usera page3
....
It should print
usera visits page1 most followed by page3.
If the number of pages
visited is equal, it is up to you how to handle the case (Can print both, or can print any
one of them)
Which data structure/algorithm will you use for this? Since this is an interview-question, efficient algorithm/data structure would be appreciated. The interviewer did not specify what order of algorithm he was looking for.
I came up with std::map<string,std::pair<string,int> > solution. The interviewer asked if I can do anything better than this or if the key set is so large it won’t be efficiently handled by map, what should be done?
I think the first step would be to remove all “non-usera” lines since you’re doing per-user parsing. This would be a one-time job separating all users into different files. After that you can do a line-by line analysis keeping only a couple of lines in “history”. You can do this using a simple line parser without having to store the whole file in-memory.
If it’s going to be need something like a data structure necessarily, you might want to look into map-reduce paradigm — Hadoop would be ideal for files on the scale of 10GB +.