I need to implement a process, wherein a text file of roughly 50/150kb is

Question

0

Asked: June 11, 20262026-06-11T16:50:26+00:00 2026-06-11T16:50:26+00:00

I need to implement a process, wherein a text file of roughly 50/150kb is

0

I need to implement a process, wherein a text file of roughly 50/150kb is uploaded, and matched against a large number of phrases (~10k).

I need to know which phrases match specifically.

A phrase could be “blah blah blah” or just “blah” – meaning I need to take word-boundaries into account, as I don’t wish to include infix matches.

My first attempt was to just create a large pre-compiled list of regular expressions that look like @"\b{0}\b" (as 10k the phrases are constant – I can cache & re-use this same list against multiple documents);

On my brand-new & very fast PC – this matching is taking 10 seconds+, which I would like to be able to reduce a great deal.

Any advice on how I may be able to achieve this would be greatly appreciated!

Cheers,
Dave

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-11T16:50:27+00:00

You could Lucene.NET and the Shingle Filter as long as you don’t mind having a cap on the number of possible words as phrase can have.

public class MyAnalyzer : Analyzer
{
    public override TokenStream TokenStream(string fieldName, TextReader reader)
    {       
        return new ShingleFilter(new LowerCaseFilter(new StandardTokenizer(Lucene.Net.Util.Version.LUCENE_29, reader)), 6);
    }
}

You can run the analyzer using this utility method.

public static IEnumerable<string> GetTerms(Analyzer analyzer, string keywords)
{
    var tokenStream = analyzer.TokenStream("content", new StringReader(keywords));
    var termAttribute = tokenStream.AddAttribute<ITermAttribute>();

    var terms = new HashSet<string>();
    
    while (tokenStream.IncrementToken())
    {
        var term = termAttribute.Term;
        if (!terms.Contains(term))
        {
            terms.Add(term);
        }
    }

    return terms;
}

Once you’ve retrieved all the terms do an intersect with you words list.

var matchingShingles = GetTerms(new MyAnalyzer(), "Here's my stuff I want to match");

var matchingPhrases = phrasesToMatch.Intersect(matchingShingles, StringComparer.OrdinalIgnoreCase);

I think you will find this method is much faster than Regex matching and respects word boundries.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I need to implement a process, wherein a text file of roughly 50/150kb is

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply