I’m working on a project that requires finding the most intersected set among a great number of other sets.
That is, I have a large number (~300k) of sets with hundreds of entries each. Given one of the sets, I need to rank the other sets in order of how intersected they are. Additionally, the set entries contain properties which can be used as a filter, e.g. For set X, order the other sets by how much they intersect with the “green” entries subset.
I have free reign to architect this solution, and I’m looking for technology recommendations. I was initially thinking a relational DB would be the best suited, but I’m not sure how well it will perform doing these real time comparisons. Somebody recommended Lucene, but I’m not sure how well that would fit the bill.
I suppose it’s worth mentioning that new sets will be added regularly and that the sets may grow, but never shrink.
I don’t know exactly what you are looking for: method, library, tool?
If you want to compute your large datasets really fast with distributed computing, you should check out MapReduce, e.g. using Hadoop on Amazon EC2/S3 services.