I have written the following script to search for a motif(substring) in a protein sequences(strings). I am beginner and writing this has been tough for me. I have two questions regarding the same:
1. Errors: The following script has few errors. I have been at it for quite sometime now but have not figured out what and why?
2. The following script has been written to search for one motif(substring) in protein sequences(strings). My next task involves searching for multiple motifs in a specific order (ex: motif1 motif2 motif 3 motif4 this order cannot be changed) in the same protein sequences(strings)
use strict;
use warnings;
my @file_data=();
my $motif ='';
my $protein_seq='';
my $h= '[VLIM]';
my $s= '[AG]';
my $x= '[ARNDCEQGHILKMFPSTWYV]';
my $regexp = "($h){4}D($x){4}D"; #motif to be searched is hhhhDxxxxD
my @locations=();
@file_data= get_file_data("seq.txt");
$protein_seq= extract_sequence(@file_data);
#searching for a motif hhhhDxxxxD in each protein sequence in the give file
foreach my $line(@file_data){
if ($motif=~ /$regexp/){
print "found motif \n\n";
}
else {
print "not found \n\n";
}
}
#recording the location/position of motif to be outputed
@locations= match_position($regexp,$seq);
if (@locations){
print "Searching for motifs $regexp \n";
print "Catalytic site is at location:\n";
}
else{
print "motif not found \n\n";
}
exit;
sub get_file_data{
my ($filename)=@_;
use strict;
use warnings;
my $sequence='';
foreach my $line(@file_data){
if ($line=~ /^\s*$/){
next;
}
elsif ($line=~ /^\s*#/){
next;
}
elsif ($line=~ /^>/){
next;
}
else {
$sequence.=$line;
}
}
$sequence=~ s/\s//g;
return $sequence;
}
sub(match_positions) {
my ($regexp, $sequence)=@_;
use strict;
my @position=();
while ($sequence=~ /$regexp/ig){
push (@position, $-[0]);
}
return @position;
}
First of all, the keyword is
elsif, second of all you don’t need it. You can compress the code in theget_file_dataloop to:As long as you’re going to use regular expressions — unless too unwieldy — you might as well search for all the cases that you want to ignore. If you find that actual second case, you can add it as an another alternation. Say you wanted to exclude lines that begin with
#-. Then you would just add it in like so:/^\s*$|^>|^#-/Another thing is that
my position=();needs to have the@sigil, before position, or otherwise, perl thinks you’re trying to something tricky with a call toposition().You need the following changes:
Otherwise, you’re just assigning to
$hto an array reference with a single slot populated by whatever would be returned from the subVLIM.Third, don’t use
$&. Replacepos($sequence)-length($&)+1or better yet, use English:
I would suggest the following for the file reading:
A suggestion for going forward — it helps me immensely:
A. Install Smart::Comments
B. Put this at the top of your script:
C. Every time you’re not sure what you’ve got so far, like if you wanted to see the current contents of
$sequence, place the following in the code:just show it and exit. When you get too many printouts, delete them.