I am an absolute newbie to Perl as well as programming in general(less than a month’s experience).
I am stumped with a problem which needs to be resolved if I am to solve a bigger issue.
Basically, I have 2 arrays which look like this:
@array1 = ('NM_1234' , '1452' , 'NM_345' , '5008' , 'NR_6145' , '256');
@array2 = ('NM_5673' , '2' , 'NM_345' , '5' , 'NR_6145' , '10');
@array1 contains id numbers followed by length. The id number is of nucleotide sequences and length is the length of the sequence.
@array2 contains id numbers followed by the number of G-Quadruplex structures in each so some sequences contain only 2 such structures while others contain 10 or more.
The basic problem is, I need to add to @array2, the “length numbers” in @array1(eg 5008, 256) for every matching id number.
So for example as NM_345 matches in both the arrays, I need to add 5008 to it, so that the final result becomes like NM_345,5,5008.
Similarly with NR_6145 and other such matches ( There are over 20,000 id numbers in @array2)
So far, I have been able to write code which can just search for the same id number in both the arrays. Here is the code:
#Enter file name
print "Enter file name: ";
$in =<>;
chomp $in;
open(FASTA,"$in") or die;
@data = <FASTA>; #Read in data
$data = join ('',@data); #Convert to string
@data2 = split('\n',$data); #Explode along newlines
#Enter 2nd file name
print "\n\nEnter 2nd file name: ";
$in2=<>;
chomp $in2;
open(FASTA,"$in2") or die;
@entry =<FASTA>; #Read in data
$entry = join('',@entry); #Convert to string
@entry2 = split('\n',$entry); #Explode along newlines
my %seen;
for $item (@data2) {
if($item =~ /([0-9]+)/){
push @{$seen{$key}}, $item;#WHAT IS THIS DOING? HOW?
}
}
for my $item (@entry2) {
if ($item =~ /([0-9]+)/){
if (exists $seen{$key}) {
print $item,"\n";
};
}
}
exit;
I derived the code which finds the same element from 2 arrays from this solution here, so full credit goes to Chas.Owens: https://stackoverflow.com/a/1064929/1468737.
And of course, I do not quite yet understand this part:
push @{$seen{$key}}, $item;#WHAT IS THIS DOING? HOW?
It appears to be an array of a hash value or something?
So , now how do I add the length element from @array1 into @array2? I need to use the splice command I think, but how?
My desired output should look like this:
NM_345,5,5008 <br>
NM_6145,10,256<br>
etc
I also need to save this output into a file which will then later be analyzed to see if there is any correlation between length and G-quadruplex number.
Any help or input will be deeply appreciated.
Thank you for taking the time to go through my problem!
EDIT: This edit is to show how the data files look like. They are basically putput files from other programs I wrote.
My first file,named, Transcriptlength.fa, with over 40,000 id numbers going into @array1 looks like this:
NR_037701
3353
NM_198399
2414
NR_026816
601
NR_027917
658
NR_002777
1278
My second file,named Quadcount.AllGtranscripts.fa, with over 20,000id numbers going into @array2, looks like this:
NM_000014
1
NM_000016
3
NM_000017
19
NM_000018
2
NM_000019
3
NM_000020
30
NM_000021
1
NM_000022
2
NM_000023
5
NM_000024
1
NM_000025
15
NM_000029
5
It looks as though you are having trouble reading the data files as well as generating the output you want. We cannot help with that part of the problem unless you show us an example of the file data, but here is a solution for producing the output correctly.
It is best if your data is stored in hashes as that allows direct access to the length and structure count for a given sequence ID. Fortunately, arrays laid out as you have described them can easily be converted to hashes by a simple assignment, so this short program does what you want from the arrays you show.
The
grep /\D/, @array2list in the loop just selects all the sequence IDs from@array2by picking only those elements that contain a non-decimal character. I have done it this way in case the order in which the sequences are displayed matters. In your final program you should probably process the data directly from the file instead of reading it into an array so this won’t be an issue.output
Update
Your file data is ideal for setting paragraph mode where records are separated by blank lines in the data file. To achieve this you set the input record separator variable
$/to an empty string"".This revised program reads records from the first file, splits them on whitespace (whitespace includes space, tab and newline, amongst others) and builds a hash
%lengthswhich relates each sequence ID to its length.The same is done to the second file, this time checking whether the sequence ID appears in the hash. If so the complete record is output.
unfortunately the sample data that you have chosen doesn’t contain matching sequence IDs so there is no output from this program when run against that data. Your actual files will be more productive.