Given a text file in the format below, each line is a list of up to 50
names. Write a program produces a list of pairs of names which appear
together in at least fifty different lists.Tyra,Miranda,Naomi,Adriana,Kate,Elle,Heidi Daniela,Miranda,Irina,Alessandra,Gisele,AdrianaIn the above sample, Miranda and Adriana appear together twice, but
every other pair appears only once. It should return
“Miranda,Adriana\n”. An approximate solution may be returned with
lists which appear at least 50 times with high probability.
I was thinking of the following solution:
-
Generate a
Map <Pair,Integer>pairToCountMap, after reading through the file. -
Iterate through the map, and print those with counts >= 50
Is there a better way to do this? The file could be very large, and I’m not sure what is meant by the approximate solution. Any links or resources would be much appreciated.
First let’s assume that names are limited in length, so operations on them are constant time.
Your answer should be acceptable if it fits in memory. If you have
Nlines withmnames each, your solution should takeO(N*m*m)to complete.If that data set doesn’t fit in memory, you can write the pairs to a file, sort that file using a merge sort, then scan through to count pairs. The running time of this is
O(N*m*log(N*m)), but due to details about speed of disk access will run much faster in practice.If you have a distributed cluster, then you could use a MapReduce. It would run very similarly to the last solution.
As for the statistics approach, my guess is that they mean running through the list of files to find the frequency of each name, and the number of lines with different numbers of names in them. If we assume that each line is a random assortment of names, using statistics we can estimate how many intersections there are between any pair of common names. This will be roughly linear in the length of the file.