Recently I was asked this question in an interview. I gave an answer in O(n) time but in two passes. Also he asked me how to do the same if the url list cannot fit into the memory. Any help is very much appreciated.
Share
Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.
Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.
Lost your password? Please enter your email address. You will receive a link and will create a new password via email.
Please briefly explain why you feel this question should be reported.
Please briefly explain why you feel this answer should be reported.
Please briefly explain why you feel this user should be reported.
If it all fits in memory, then the problem is simple: Create two sets (choose your favorite data structure), both initially empty. One will contain unique URLs and the other will contain URLs that occur multiple times. Scan the URL list once. For each URL, if it exists in the unique set, remove it from the unique set and put it in the multiple set; otherwise, if it does not exist in the multiple set, add it to the unique set.
If the set does not fit into memory, the problem is difficult. The requirement of O(n) isn’t hard to meet, but the requirement of a “single pass” (which seems to exclude random access, among other things) is tough; I don’t think it’s possible without some constraints on the data. You can use the set approach with a size limit on the sets, but this would be easily defeated by unfortunate orderings of the data and would in any event only have a certain probability (<100%) of finding a unique element if one exists.
EDIT:
If you can design a set data structure that exists in mass storage (so it can be larger than would fit in memory) and can do find, insert, and deletes in O(1) (amortized) time, then you can just use that structure with the first approach to solve the second problem. Perhaps all the interviewer was looking for was to dump the URLs into a data base with a UNIQUE index for URLs and a count column.