I’m writing some code that iterates a set of POS tags (generated by pos_tag in NLTK) to search for POS patterns. Matching sets of POS tags are stored in a list for later processing. Surely a regex-style pattern filter already exists for a task like this, but a couple of initial google searches didn’t give me anything.
Are there any code snippets out there that can do my POS pattern filtering for me?
Thanks,
Dave
EDIT: Complete solution (using RegexParser, and where messages is any string)
text = nltk.word_tokenize(message)
tags = nltk.pos_tag(text)
grammar = r"""
RULE_1: {<JJ>+<NNP>*<NN>*}
"""
chunker = nltk.RegexpParser(grammar)
chunked = chunker.parse(tags)
def filter(tree):
return (tree.node == "RULE_1")
for s in chunked.subtrees(filter):
print s
Check out http://nltk.googlecode.com/svn/trunk/doc/book/ch07.html and http://www.regular-expressions.info/reference.html for more on creating the rules.
I think you’re looking for
RegexpChunkParser.