I’m trying to design a somewhat unconventional NER system that marks certain multiword strings

Question

0

Asked: June 6, 20262026-06-06T12:41:50+00:00 2026-06-06T12:41:50+00:00

I’m trying to design a somewhat unconventional NER system that marks certain multiword strings

0

I’m trying to design a somewhat unconventional NER system that marks certain multiword strings as single units/tokens.

There are a lot of cool NER tools out there, but I have a few special needs that make it pretty much impossible to use something straight out of the box:

First, the entities can’t just be extracted and printed out in a list–they need to be marked in some way and consolidated into tokens.

Second, categorization is not important–Person/Organization/Location doesn’t matter (at least in the output).

Third, these aren’t just your typical ENAMEX named entities we’re looking for. We want companies and organizations, but also concepts like ‘climate change’ and ‘gay marriage.’ I’ve seen tags like these on some tools out there, but all of them were ‘extraction-style’.

How would I got about getting this type of functionality? Would training the Stanford tagger on my own, hand-annotated dataset do the job (where ‘climate change’-esque phrases are labeled MISC or something)? Or am I better off just making a shortlist of the ‘weird’ entities and checking the text against that after it’s been run through a regular NER system?

Thanks so much!

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-06T12:41:51+00:00

The underlying CRF model of a named entity tagger such as Stanford NER can actually be used to recognize anything, not just named entities. There are certainly people who have used them quite successfully to pick out various kinds of terminological phrases. The software can certainly give you marked up token sequences in context.

There is, however, a choice as to whether to approach this in a “more unsupervised” way, where something like NP chunking and collocation statistics are used, or the fully supervised way of a straightforward CRF, where you’re providing lots of annotated data of the kind of phrases you’d like to get out.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m trying to design a somewhat unconventional NER system that marks certain multiword strings

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply