I’m trying to perform a dictionary-based NER on some documents. My dictionary, regardless of

Question

0

Asked: May 17, 20262026-05-17T15:52:32+00:00 2026-05-17T15:52:32+00:00

I’m trying to perform a dictionary-based NER on some documents. My dictionary, regardless of

0

I’m trying to perform a dictionary-based NER on some documents. My dictionary, regardless of the datatype, consists of key-value pairs of strings. I want to search for all the keys in the document, and return the corresponding value for that key whenever a match occurs.

The problem is, my dictionary is fairly large: ~7 million key-values – average length of keys: 8 and average length of values: 20 characters.

I’ve tried LingPipe with MapDictionary but on my desired environment setup, it runs out of memory after 200,000 rows are inserted. I don’t know clearly why LingPipe uses a map and not a hashmap in their algorithm.

So the thing is, I don’t have any previous experience with Lucene and I want to know if it makes such thing with such number possible in an easier way.

ps. I’ve already tried chunking the data into several dictionaries and writing them on disk but it’s relatively slow.

Thanks for any help.

Cheers
Parsa

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-17T15:52:33+00:00

I suppose if you wanted to reuse LingPipe’s ExactDictionaryChunker to do the NER, you could override their MapDictionary to store & retrieve from your choice of key/value database instead of their ObjectToSet (which does extend HashMap, by the way).

Lucene/solr can be used as a key/value store, but if you don’t need the extra searching capabilities, just a pure look-up, other options might be better for what you’re doing.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m trying to perform a dictionary-based NER on some documents. My dictionary, regardless of

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply