I have about 100,000 strings in database and I want to if there is a way to automatically generate regex pattern from these strings. All of them are alphabetic strings and use set of alphabets from English letters. (X,W,V) is not used for example. Is there any function or library that can help me achieve this target in C#? Example strings are
KHTK
RAZ
Given these two strings my target is to generate a regex that allows patterns like (k, kh, kht,khtk, r, ra, raz) case insensitive of course. I have downloaded and used some C# applications that help in generating regex but that is not useful in my scenario because I want a process in which I sequentially read strings from db and add rules to regex so this regex could be reused later in the application or saved on the disk.
I’m new to regex patterns and don’t know if the thing I’m asking is even possible or not. If it is not possible please suggest me some alternate approach.
A simple (some might say naive) approach would be to create a regex pattern that concatenates all the search strings, separated by the alternation operator
|:KHTK|RAZ.K|KH|KHT|KHTK|R|RA|RAZ.^K$|^KH$|^KHT$|^KHTK$|^R$|^RA$|^RAZ$We would expect the Regex class implementation to do the heavy lifting of converting the long regex pattern string to an efficient matcher.
The sample program here generates 10,000 random strings, and a regular expression that matches exactly those strings and all their prefixes. The program then verifies that the regex indeed matches just those strings, and times how long it all takes.
On an Intel Core2 box I’m getting the following numbers for 10,000 strings:
When increasing the number of strings 10-fold (to 100,000), I’m getting:
This is higher, but the growth is less than linear.
The app’s memory consumption (at 10,000 strings) started at ~9MB, peaked at ~23MB that must have included both the regex and the string set, and dropped to ~16MB towards the end (garbage collection kicked in?) Draw your own conclusions from that — the program doesn’t optimize for teasing out the regex memory consumption from the other data structures.