Let’s suppose I have a data set of several hundred thousand strings (which happen

Question

0

Asked: June 7, 20262026-06-07T17:26:19+00:00 2026-06-07T17:26:19+00:00

Let’s suppose I have a data set of several hundred thousand strings (which happen

0

Let’s suppose I have a data set of several hundred thousand strings (which happen to be natural language sentences, if it matters) which are each tagged with a certain “label”. Each sentence is tagged with exactly one label, and there are about 10 labels, each with approximately 10% of the data set belonging to them. There is a high degree of similarity to the structure of sentences within a label.

I know the above sounds like a classical example of a machine learning problem, but I want to ask a slightly different question. Are there any known techniques for programatically generating a set of regular expressions for each label, which can successfully classify the training data while still generalizing to future test data?

I would be very happy with references to the literature; I realize that this will not be a straightforward algorithm 🙂

PS: I know that the normal way to do classification is with machine learning techniques like an SVM or such. I am, however, explicitly looking for a way to generate regular expressions. (I would be happy with with machine learning techniques for generating the regular expressions, just not with machine learning techniques for doing the classification itself!)