I’ve got an external text file which looks like this:
This_ART is_P an_ART example_N.
Thus_KONJ this_ART is_P a_ART part_N of_PREP it_N.
Now I want to open this file in Ruby and make an Array with every annotated word. My attempt looks like this:
def get_entries(file)
return File.open(file).map { |x| x.split(/\W+_[A-Z]+/) }
end
but the execution just returns an Array with each sentence as a member:
[["This_ART is_P an_ART example_N.\n"],["Thus_KONJ this_ART is_P a_ART part_N of PREP it_N.\n"]]
The punctuation and the escape characters are included. Where is the mistake or what do I have to change to get the correct array?
try scanning for just the ones you want, e.g.
that will give you something like:
if you want the annotation part removed, you could tack on:
note that \w is word chars and \W is non-word chars