I have a large protein sequence which is approximetaly 5000 so I put it in a text file (p_sqn.txt) and I have the following sequence
for example ; SDJGSKLDJGSNMMUWEURYI
I have to find that percentage identity scoring function, so for that I have to find the most similar sequence in the protein sequence. (protein_sequence.txt)
I would start with checking the Levenshtein distance at every point in the sequence.
With a length of just 5000, it won’t take very long (milliseconds) to do the pass.
Fortunately, the Apache commons-lang library provides the
StringUtils.getLevenshteinDistance()utility method. With this, the code would be just a few lines:fyi, a score of zero means an exact match was found.
To make it easy to read the file in, you can use Apache common-io library utility method
FileUtils.readFileToString(), like this: