I am stuck with a rather unique problem. I have 2 files which I am reading. A small version of those 2 files look like the following:
File1
chr1 9873 12227 11873 2354 + NR_046018 DDX11L1
chr1 760970 763155 762970 2185 + NR_047520 LOC643837
File2
chr1 9871 0
chr1 9872 1
chr1 9873 1
chr1 9874 2
chr1 9875 1
chr1 9876 3
chr1 9877 3
chr1 760970 1
chr1 760971 1
chr1 760972 1
chr1 760973 2
chr1 760974 3
chr1 760975 3
chr1 760976 4
chr1 760977 5
chr1 760978 6
chr1 760979 7
chr1 760980 6
chr1 760981 7
chr1 760982 8
chr1 760983 9
chr1 760984 10
chr1 760985 11
chr1 760986 12
chr1 760987 10
chr1 760988 9
chr1 760989 6
Problem
-
From 1st file, I have to pick up the 2nd element from each row and take it as
$start. An ending position is determined by$end = $start + 10. -
Based on
$start, I now have to take the 2nd file, and look at 2nd element of each row. Once$startis found, I need to sum the next 5 corresponding values of 3rd element in groups of 5, upto$end.
So as $end is $start + 10 and I am summing in groups of 5, 2 summation values would be obtained.
In case some values upto $end is not present in the 2nd element of 2nd file, the code should not stop, it should continue to perform summation and display sum as 0 (in case a continuous group of 5 elements is not present).
Taking the example of the files here, from File1, 2nd element = 9873, which is assigned to $start. Thus $end would be $start+10 ie 9883.
From File2, once $start is found in the 2nd element of the row, the 3rd element for the next 5 rows have to be summed as 1 group, and the next 5 values summed as 2nd group till $end.
Note
Here as can be seen in File2, $end i.e 9883 is not present. Hence sum of values from 9879 to 9883 must be zero. It must not sum the values of 760970 onwards…
Desired Output
chr1 9873 12227 11873 2354 + NR_046018 DDX11L1 10 0
chr1 760970 763155 762970 2185 + NR_047520 LOC643837 8 25
Points to Note
- While dealing with actual files, $end = $start+10,000(instead of $end = $start+10)
- Also,in the same note, groups of 25 values will be summed(instead of 5), obtaining total 400 values while working with the actual files.
- In case there are a range of values which are not present in the 2nd element of $file2, summation should proceed as normal, if a continuous pair of 25 values are absent,
0should be printed out. - The files contain > 1 million rows each.
Code
The code I’ve written so far manages to do the following :
- Read from files.
- Assign
$startand$endfrom file1 - From file2 , push all 2nd elements into array
@c_posn; all 3rd elements into array@peak. - Check if
$startis present in@c_posn
I am not able to figure out how to do the summation part. I had thought of creating a hash, where all 2nd elements of 2nd file go into keys and 3rd elements into values. But the hash is coming unordered. So I created the 2 arrays namely @c_posn for 2nd elements, @peaks for 3rd elements. But now I don’t know how to simultaneously compare the 2 arrays( to ensure values of 760970 don’t get summed)
use 5.012;
use warnings;
use List::Util qw/first/;
my $file1 = 'chr1trialS.out';
my $file2 = 'b1.wig';
open my $fh1,'<',$file1 or die qw /Can't_open_file_$file1/;
open my $fh2,'<',$file2 or die qw /Can't_open_file_$file2/;
my($start, $end);
while(<$fh1>){
my @val1 = split;
$start = $val1[1]; #Assign start value
$end = $start + 10; #Assign end value
say $start,"->",$end; #Can be commented out
}
my @c_posn;
my @peak;
while(<$fh2>){
my @val2 = split;
push @c_posn,$val2[1]; #Push all 2nd elements
push @peak, $val2[2]; #Push all 3rd elements
}
if (first { $_ eq $start} @c_posn) { say "I found it! " } #To check if $start is present in @c_posn
say "@c_posn"; #just to check all 2nd elements are obtained
say "@peak"; #just to check all 3rd elements are obtained
Thank you for taking the time to go through my problem. If any clarifications are needed, please do ask me.
I will be grateful for any and every comment/answer.
This is straightforward to do if
b1.wigis small enough to be read into a hash in memory, taking the keys from column 2 and the values from column 3. Then all that must be done is to access each key in each sequence, using zero if a corresponding hash element is non-existent (and so accessing it returnsundef).You haven’t said how you want to separate the new totals from the existing data from
chr1trialS.outso I have used spaces. Of course this is easy to change if necessary.output