I am building a 15k line training data document called: en-ner-person.train per the online manual (http://opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html).
My question is: in my training document, do I include an entire report? Or do I only include the lines which have a name: <START:person> John Smith <END>?
So for example do I use this entire report in my training data:
<START:person> Pierre Vinken <END> , 61 years old , will join the board as a nonexecutive director Nov. 29 .
A nonexecutive director has many similar responsibilities as an executive director.
However, there are no voting rights with this position.
Mr . <START:person> Vinken <END> is chairman of Elsevier N.V. , the Dutch publishing group .
Or do I only include these two lines in my training document:
<START:person> Pierre Vinken <END> , 61 years old , will join the board as a nonexecutive director Nov. 29 .
Mr . <START:person> Vinken <END> is chairman of Elsevier N.V. , the Dutch publishing group .
You should use the entire report. This would help the system to learn when not to mark an entity, improving false negatives score.
You can measure it using the evaluation tool. Reserve some sentences of your corpus for testing, for example 1/10 of the total, and train your model using the other 9/10 sentences. You can try training using the entire report and another with only the sentences with names. The results will be in terms of precision and recall.
Remember to keep the test sample with the entire report, not only the sentences with names, otherwise you will not have an accurate measure of how the model would perform with sentences without names.