I would like to calculate the number of pairwise differences between a long list of sequences, and put it back into a matrix form.
I have a few hundred genetic sequences, and each sequence is already aligned and has the same length (about 300 characters). I’m not looking for one of the edit distance algorithms (hamming’s, leveinstein’s, etc) but instead would like to get the number of absolute differences between two sequences. The sequences would have to be compared at each character position.
For example,
Sequence 1: "GAT-ACA"
Sequence 2: "AT-GCGA"
Number of differences: 6
(The dash is there to allow the sequences to be aligned, and my sequences may also include dashes).
Would there be any efficient way to do this using python (or other language), with a short computing time? I also asked this question in R, initially intending to do it that way, but it turned out too slow to be feasible to apply to several hundred sequences.
Thank you!
If you want to calculate the matrix that displays the differences between the pairs you can do it like this:
Result:
The resulting matrix (OK array) shows the differences between the pairs.