I am trying to figure out how to train the stanford LexicalizedParser
( edu.stanford.nlp.parser.lexparser.LexicalizedParser ) to incorporate new nouns into its lexicon.
At first my goal was to take take an existing model and tweak it slightly, rather than creating a brand new model
from a vast set of training examples.
the answer to this question suggests that is not possible >
How can I add more tagged words to the Stanford POS-Tagger's trained models?
Hopefully someone out there can put me on the right track as to how to do this.
As a concrete example of what i want to do, say i have the word ‘researchgate’ which i want to be treated as a noun when i parse
sentences. Currently, ‘researchgate’ is getting treated as different parts of speech, depending on its
position.. but i want it identified as an ‘NN’ (noun).
Examples…
instead of this:
(NP
(NP (JJ recent) (NN activity))
(PP (IN in)
(NP (PRP$ your) (JJ researchgate) (NNS topics)))))
i want this:
(NP
(NP (JJ recent) (NN activity))
(PP (IN in)
(NP (PRP$ your) (NN researchgate) (NNS topics)))))
and instead of this:
(ROOT
(FRAG
(NP (NN subscription))
(S
(VP (TO to)
(VP (VB researchgate))))))
i want this:
(ROOT
(NP
(NP (NN subscription))
(PP (TO to)
(NP (NN researchgate)))))
I am currently using this model: models/edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz
I tried doing this >
java -cp stanford-parser.jar
edu.stanford.nlp.parser.lexparser.LexicalizedParser -train /tmp/train.txt
with the contensts of /tmp/train.txt as follows >
(NP
(NP (JJ recent) (NN activity))
(PP (IN in)
(NP (PRP$ your) (JJ researchgate) (NNS topics)))))
I got a bunch of promising output, but then got this error >
Error. Can't parse test sentence: [This, is, just, a, test, .]
So clearly i need to supply more examples than just the one i have in /tmp/train.txt.
Looking at the documentation there seems to be one promising method on
LexicalizedParser that I am considering trying… >
public static LexicalizedParser getParserFromTreebank(Treebank trainTreebank,
Treebank secondaryTrainTreebank,
double weight,
GrammarCompactor compactor,
Options op,
Treebank tuneTreebank,
List<List<TaggedWord>> extraTaggedWords)
i am hesitant to jump in and try this because it seems tricky to get the Options right.
The doco says:
options to the parser which MUST be the SAME at both training and testing (parsing) time in
order for the parser to work properly
so i might need guidance on how to extract the options used for
edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz perhaps it is
edu.stanford.nlp.parser.lexparser.EnglishTreebankParserParams ?
Also, maybe i want to add researchgate in as one of my extraTaggedWords ?
I have the feeling i am on the right track but was hoping to get some advice before descending
into a rat hole.
Thanks in advance !
chris
I posted to stanford parser mailing list and I received an answer from John Bauer (thanks, John !)
John Bauer
2:09 PM (39 minutes ago)
to me, parser-user
Unfortunately, you would need to start training from the beginning. There is no way to extend a current parser model.
That feature is on “the list”, but it’s somewhere near the back, so don’t hold your breath…
John