I’m making an application using a dependency tree parser. Actually, the parser is this one:
Parser Stanford, but it rarely change one or two letters of some words in a sentence that I want to parse. This is a big trouble for me, because I can’t see any pattern in these changes and I need the dependency tree with the same words of my sentence.
All I can see is that just some words have these problems. I’m working with a tweets database. So, I have a lot of grammar mistakes in this data. For example the hashtag ‘#AllAmericanhumour ‘ becomes AllAmericanhumor. It misses one letter(u).
Is there anything I can do to solve this problem? In my first view I thought using an edit distance algorithm, but I think that might be an easier way to do it.
Thanks everybody in advance
You can give options to the tokenizer with the -tokenize.options flag/property. For this particular normalization, you can turn it off with
There are also various other normalizations that you can turn off (see PTBTokenizer or http://nlp.stanford.edu/software/tokenizer.shtml. You can turn off a lot with
However, the parser is trained on data that looks like the output of
ptb3Escaping=trueand so will tend to degrade in performance if used with unnormalized tokens. So, you may want to consider alternative strategies.If you’re working at the Java level, you can look at the word tokens, which are actually Maps, and they have various keys. OriginalTextAnnotation will give you the unnormalized token, even when it has been normalized. CharacterOffsetBeginAnnotation and CharacterOffsetEndAnnotation will map to character offsets into the text.
p.s. And you should accept some answers :-).