Given two or more samples of text, specifically segments of code, what’s the most efficient way of detecting where the samples differ and forming a pattern that matches each sample?
For example, given the following samples of code:
cd ~/workspaces/project/tmp1/bin
rsync --recursive --progress /data/local/documents* data
cd ~/workspaces/project/we32usZ/bin
rsync --recursive --progress /data/local/lib* data
cd ~/workspaces/project/oiususs/bin
rsync --recursive --progress /data/local/usr* data
How would I deduce this pattern (where $varN indicates a wildcard variable)?
cd ~/workspaces/project/$var1/bin
rsync --recursive --progress /data/local/$var2* data
My initial approach is to compare two samples, comparing each ith letter until a difference is found, afterwards searching for where the “variable” part of the text ends, and then repeat this for other samples. However, this seems very inefficient, and obviously assumes the texts are very similar to begin with. Is there a better way?
For something like the example you mentioned, some variation of multiple sequence alignment would help. You are basically looking for conserved substrings in all segments of your code via dynamic programming.