I’m trying to extract sequences from a database using the following code:
use strict;
use Bio::SearchIO;
use Bio::DB::Fasta;
my ($file, $id, $start, $end) = ("secondround_merged_expanded.fasta","C7136661:0-107",1,10);
my $db = Bio::DB::Fasta->new($file);
my $seq = $db->seq($id, $start, $end);
print $seq,"\n";
Where the header of the sequence I’m trying to extract is: C7136661:0-107, as in the file:
>C7047455:0-100
TATAATGCGAATATCGACATTCATTTGAACTGTTAAATCGGTAACATAAGCAGCACACCTGGGCAGATAGTAAAGGCATATGATAATAAGCTGGGGGCTA
The code works fine when I switch the header to something more standard (like test). I’m thinking that BioPerl doesn’t like the non-standard heading. Any way to fix this so I don’t have to recode the FASTA file?
By default,
Bio::DB::Fastawill use all non-space characters immediately following the>on the header line to form the key for the sequence. In your case this looks likeC7047455:0-100, which is the same as the built-in abbreviation for a subsequence. As documented here, instead of$db->seq($id, $start, $stop)you can use$db->seq("$id:$start-$stop"), so a call to$db->seq('C7136661:0-107')looks like you are asking for$db->seq('C7136661', 0, 107), and that key doesn’t exist.I have no way of knowing what is in your data, but if it is adequate to use just the first part of the header up to the colon as a key then you can use the
-makeidcallback to modify the key. Then if you use justC7136661to retrieve the sequence it will work.This code demonstrates. Note that you will probably already have a
.indexcache file that you must delete before you see any change in behaviour.