I’m trying to design a somewhat unconventional NER system that marks certain multiword strings as single units/tokens.
There are a lot of cool NER tools out there, but I have a few special needs that make it pretty much impossible to use something straight out of the box:
First, the entities can’t just be extracted and printed out in a list–they need to be marked in some way and consolidated into tokens.
Second, categorization is not important–Person/Organization/Location doesn’t matter (at least in the output).
Third, these aren’t just your typical ENAMEX named entities we’re looking for. We want companies and organizations, but also concepts like ‘climate change’ and ‘gay marriage.’ I’ve seen tags like these on some tools out there, but all of them were ‘extraction-style’.
How would I got about getting this type of functionality? Would training the Stanford tagger on my own, hand-annotated dataset do the job (where ‘climate change’-esque phrases are labeled MISC or something)? Or am I better off just making a shortlist of the ‘weird’ entities and checking the text against that after it’s been run through a regular NER system?
Thanks so much!
The underlying CRF model of a named entity tagger such as Stanford NER can actually be used to recognize anything, not just named entities. There are certainly people who have used them quite successfully to pick out various kinds of terminological phrases. The software can certainly give you marked up token sequences in context.
There is, however, a choice as to whether to approach this in a “more unsupervised” way, where something like NP chunking and collocation statistics are used, or the fully supervised way of a straightforward CRF, where you’re providing lots of annotated data of the kind of phrases you’d like to get out.