I have a large protein sequence which is approximetaly 5000 so I put it

Question

0

Asked: June 14, 20262026-06-14T21:40:48+00:00 2026-06-14T21:40:48+00:00

I have a large protein sequence which is approximetaly 5000 so I put it

0

I have a large protein sequence which is approximetaly 5000 so I put it in a text file (p_sqn.txt) and I have the following sequence

for example ; SDJGSKLDJGSNMMUWEURYI

I have to find that percentage identity scoring function, so for that I have to find the most similar sequence in the protein sequence. (protein_sequence.txt)

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-14T21:40:49+00:00

I would start with checking the Levenshtein distance at every point in the sequence.

With a length of just 5000, it won’t take very long (milliseconds) to do the pass.

Fortunately, the Apache commons-lang library provides the StringUtils.getLevenshteinDistance() utility method. With this, the code would be just a few lines:

import org.apache.commons.lang.StringUtils;

String protein; // the full sequence
String part; // your search string
int bestScore = Integer.MAX_VALUE;
int bestLocation = 0;
String bestSeqence = "";
for (int i = 0; i < protein.length() - part.length(); i++) {
    String sequence = protein.substring(i, part.length());
    int score = StringUtils.getLevenshteinDistance(sequence, part);
    if (score < bestScore) {
        bestScore = score;
        bestLocation = i;
        bestSeqence = sequence;
    }
}

// at this point in the code, the "best" variables will have data about the best match.

fyi, a score of zero means an exact match was found.

To make it easy to read the file in, you can use Apache common-io library utility method FileUtils.readFileToString(), like this:

import org.apache.commons.io.FileUtils;

String protein = FileUtils.readFileToString(new File("/some/path/to/myproteinfile.txt"));

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a large protein sequence which is approximetaly 5000 so I put it

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply