I have a list of about 1500 strings from an external database and over time, as a group of business users managed them, they came to have recurring substrings which have semantic value.
I’m building a front-end and would like to present the user with filtering drop down list of those substrings.
For example if I have the input strings:
- US foo
- US bar (Inactive)
- UK bat
- UK baz (Inactive)
- AU womp
- AU rat
I want to get back:
- US
- UK
- AU
- Inactive
My first thoughts are to have a threshold parameter and a list of delimeters. For the above I might say threshold=.3 and delimiters are space, (, and ).
Then do a string.split on using the delimiters and use a datastructure like a set that that counts repeated items (?)…
I am not trying to have someone do my work for me here – advice on the approach to take from someone who has done this would be great.
This problem is a good candidate for a Linq approach: