I want to find most often seen string in a huge log file. Can someone help me how to do this. one way to do this is to hash each and every string and count the maximum value but its not efficient. Are there any better ways to do this.
Thanks & Regards,
Mousey.
If performance is critical you may want to look at a trie or a Radix tree.
If you’re just interested to know if one of the strings appears more than 50% of the times (let’s call that string the majority string) you can do something like this (see if I can get this right):
get the first string and assume it’s the majority string and set it’s occurrence count to 1;
get the next string
if it’s the same as the current majority candidate increment it’s occurrence count
otherwise decrement the occurrence count
if the occurrence count reaches 0 replace the majority candidate with the current string
repeat from 2 as long as you have strings to read
if at the end the occurrence count is greater than 0 rescan the log and count the actual number of occurrences of the candidate to check if it really is the majority string.
So you’ll have to go through the log twice.
Note: This is from a problem used in an ACM programming contest a while ago, available here.