I’m writing a learning algorithm for automatic constituent bracketing. Since the algorithm starts from scratch, the bracketing (embedded) should be random at first. It is then improved through iterations. I’m stuck with how to do random bracketing. Can you please suggest a code in R or Python or give some programming idea (pseudo code)? I also need ideas on how to check a random bracketing against a proper one for correctness.
This is what I’m trying to finally arrive at, through the learning process, starting from random bracketing.
This is a sentence.
‘He’ ‘chased’ ‘the’ ‘dog.’
Replacing each element with grammatical elements,
N, V, D, N.
Bracketing (first phase) (D, N are constituents):
(N) (V) (D N)
Bracketing (second phase):
(N) ((V) (D N))
Bracketing (third phase):
((N) ((V) (D N)))
Please help. Thank you.
Here’s all I can say with the information provided:
A naive way for the bracketing would be to generate some trees (generating all can quickly get very space consuming), having as many leaves as there are words (or components), then selecting a suitable one (at random or according to proper partitioning) and apply it as bracketing pattern. For more efficiency, look for a true random tree generation algorithm (I couldn’t find one at the moment).
Additionally, I’d recommend reading about genetic algos/evolutionary programming, especially fitness fucnctions (which are the “check random results for correctness” part). As far as I understood you, you want the program to detect ways of parsing and then keep them in memory as “learned”. That quite matches a genetic algorithm with memorization of “fittest” patterns (and only mutation as changing factor).
An awesome, very elaborate (if working), but probably extremely difficult approach would be to use genetic programming. But that’s probably too different from what you want.
And last, the easiest way to check correctness of bracketing imo would be to keep a table with the grammar/syntax rules and compare with them. You also could improve this to a better fitness function by keeping them in a tree and measuring the distance from the actual pattern (
(V D) N) to the correct pattern (V (D N)). (which is just some random idea, I’ve never actually done this..)