Please find an example sentence:
(S1 (S (S (NP (NP (NP (NN Activation)) (PP (IN of) (NP (NN protein) (NN kinase) (NN C)))) (CC and) (NP (NP (NN elevation)) (PP (IN of) (NP (NN cAMP))))) (VP (VBP interact) (ADVP (RB synergistically)) (S (VP (TO to) (VP (VB raise) (NP (NP (NP (NP (NN c-Fos)) (CC and) (NP (NN AP-1))) (NN activity)) (PP (IN in) (NP (NN Jurkat) (NNS cells))))))))) (. .)))
The objective is create a tree from this sentence; the leaves are the words, the intermediate nodes are the Part of speech tags, and the root is S1. The parenthesis indicate the span of the phrases contained within the sentence; they need not be included in the tree.
What would be a good combination of data-structures to achieve the above objective, and would you also be able to share the pseudo-code which supports your suggestion?
I have in mind a HashMap and ArrayList, but am confused how to actually begin with the implementation. Just that the logic isn’t coming intuitively to me at this point. Your suggestions would be appreciated.
Thanks.
I’ve done something similar in Python recently. I’ll address just the part about data structures:
Each part of speech is either a list of parts of speech (a non-terminal), or just a single word (a terminal). So you could create classes something like:
This is relatively untyped: your program needs to construct only valid strutures of these (e.g. no making a VerbPhrase consisting of a list of Sentences — you can do it but it’s meaningless!).
The alternative route is to define a more explicit type system. So you could define classes for each type of part of speech, e.g.
Since there are several ways of making a verb phrase, you could instead have classes for each type:
And so on. The optimal degree of type explicitness at the Java level might not be obvious at the outset. Start by writing something that handles simple sentences, and see how it feels.
In my case I created classes for each part of speech type, though each inherits from either a Terminal or Nonterminal. Then I have rules for how you can construct each type. This got kind of messy for some of them, e.g.
That’s code for saying, “a NounPhrase is a Noun. Or, it’s a RelativeNoun. Or, it’s a Noun followed by an AdjectiveClause. Or, etc.”. There’s a general parser which attempts to apply rules to a list of words until it gets a tree. (You can see the messy and undocumented code at http://code.google.com/p/ejrh/source/browse/trunk/ircbot/nl.py .)
There’s a certain amount of combinatorial explosion here. It could possibly be improved by introducing new types of parts of speech, or just making some parts of it as optional: intead of having a rule for every possible combination of Article/Adjectives/AdjectiveClause/etc. present/absent, you could just make them optional.