Let’s suppose I have a data set of several hundred thousand strings (which happen to be natural language sentences, if it matters) which are each tagged with a certain “label”. Each sentence is tagged with exactly one label, and there are about 10 labels, each with approximately 10% of the data set belonging to them. There is a high degree of similarity to the structure of sentences within a label.
I know the above sounds like a classical example of a machine learning problem, but I want to ask a slightly different question. Are there any known techniques for programatically generating a set of regular expressions for each label, which can successfully classify the training data while still generalizing to future test data?
I would be very happy with references to the literature; I realize that this will not be a straightforward algorithm 🙂
PS: I know that the normal way to do classification is with machine learning techniques like an SVM or such. I am, however, explicitly looking for a way to generate regular expressions. (I would be happy with with machine learning techniques for generating the regular expressions, just not with machine learning techniques for doing the classification itself!)
So far as I know, this is the subject of current research in evolutionary computation.
Here are some examples:
See slides 40-44 at
https://cs.byu.edu/sites/default/files/Ken_De_Jong_slides.pdf
(slides exist as of the posting of this answer).
Also, see
http://www.citeulike.org/user/bartolialberto/article/10710768
for a more detailed review of a system presented at GECCO 2012.