I need to implement a process, wherein a text file of roughly 50/150kb is uploaded, and matched against a large number of phrases (~10k).
I need to know which phrases match specifically.
A phrase could be “blah blah blah” or just “blah” – meaning I need to take word-boundaries into account, as I don’t wish to include infix matches.
My first attempt was to just create a large pre-compiled list of regular expressions that look like @"\b{0}\b" (as 10k the phrases are constant – I can cache & re-use this same list against multiple documents);
On my brand-new & very fast PC – this matching is taking 10 seconds+, which I would like to be able to reduce a great deal.
Any advice on how I may be able to achieve this would be greatly appreciated!
Cheers,
Dave
You could Lucene.NET and the Shingle Filter as long as you don’t mind having a cap on the number of possible words as phrase can have.
You can run the analyzer using this utility method.
Once you’ve retrieved all the terms do an intersect with you words list.
I think you will find this method is much faster than Regex matching and respects word boundries.