I am stuck with a rather unique problem. I have 2 files which I

Question

0

Asked: June 16, 20262026-06-16T20:21:53+00:00 2026-06-16T20:21:53+00:00

I am stuck with a rather unique problem. I have 2 files which I

0

I am stuck with a rather unique problem. I have 2 files which I am reading. A small version of those 2 files look like the following:

File1

chr1    9873    12227   11873   2354    +   NR_046018   DDX11L1
chr1    760970  763155  762970  2185    +   NR_047520   LOC643837

File2

chr1    9871    0   
chr1    9872    1
chr1    9873    1
chr1    9874    2
chr1    9875    1
chr1    9876    3
chr1    9877    3
chr1    760970  1
chr1    760971  1
chr1    760972  1
chr1    760973  2
chr1    760974  3
chr1    760975  3
chr1    760976  4
chr1    760977  5
chr1    760978  6
chr1    760979  7
chr1    760980  6
chr1    760981  7
chr1    760982  8
chr1    760983  9
chr1    760984  10
chr1    760985  11
chr1    760986  12
chr1    760987  10
chr1    760988  9
chr1    760989  6

Problem

From 1st file, I have to pick up the 2nd element from each row and take it as $start. An ending position is determined by $end = $start + 10.
Based on $start, I now have to take the 2nd file, and look at 2nd element of each row. Once $start is found, I need to sum the next 5 corresponding values of 3rd element in groups of 5, upto $end.

So as $end is $start + 10 and I am summing in groups of 5, 2 summation values would be obtained.

In case some values upto $end is not present in the 2nd element of 2nd file, the code should not stop, it should continue to perform summation and display sum as 0 (in case a continuous group of 5 elements is not present).

Taking the example of the files here, from File1, 2nd element = 9873, which is assigned to $start. Thus $end would be $start+10 ie 9883.

From File2, once $start is found in the 2nd element of the row, the 3rd element for the next 5 rows have to be summed as 1 group, and the next 5 values summed as 2nd group till $end.

Note

Here as can be seen in File2, $end i.e 9883 is not present. Hence sum of values from 9879 to 9883 must be zero. It must not sum the values of 760970 onwards…

Desired Output

chr1    9873    12227   11873   2354    +   NR_046018   DDX11L1      10   0
chr1    760970  763155  762970  2185    +   NR_047520   LOC643837    8   25

Points to Note

While dealing with actual files, $end = $start+10,000(instead of $end = $start+10)
Also,in the same note, groups of 25 values will be summed(instead of 5), obtaining total 400 values while working with the actual files.
In case there are a range of values which are not present in the 2nd element of $file2, summation should proceed as normal, if a continuous pair of 25 values are absent, 0 should be printed out.
The files contain > 1 million rows each.

Code

The code I’ve written so far manages to do the following :

Read from files.
Assign $start and $end from file1
From file2 , push all 2nd elements into array @c_posn ; all 3rd elements into array @peak.
Check if $start is present in @c_posn

I am not able to figure out how to do the summation part. I had thought of creating a hash, where all 2nd elements of 2nd file go into keys and 3rd elements into values. But the hash is coming unordered. So I created the 2 arrays namely @c_posn for 2nd elements, @peaks for 3rd elements. But now I don’t know how to simultaneously compare the 2 arrays( to ensure values of 760970 don’t get summed)

use 5.012;
use warnings;
use List::Util qw/first/;

my $file1 = 'chr1trialS.out';
my $file2 = 'b1.wig';

open my $fh1,'<',$file1 or die qw /Can't_open_file_$file1/;
open my $fh2,'<',$file2 or die qw /Can't_open_file_$file2/;

my($start, $end);
while(<$fh1>){
    my @val1 = split;
    $start = $val1[1]; #Assign start value
    $end = $start + 10; #Assign end value
    say $start,"->",$end; #Can be commented out
}

my @c_posn;
my @peak;

while(<$fh2>){
    my @val2 = split;   
    push @c_posn,$val2[1]; #Push all 2nd elements 
    push @peak, $val2[2];  #Push all 3rd elements        
}           

if (first { $_ eq $start} @c_posn) { say "I found it! " } #To check if $start is present in @c_posn

say "@c_posn"; #just to check all 2nd elements are obtained
say "@peak"; #just to check all 3rd elements are obtained

Thank you for taking the time to go through my problem. If any clarifications are needed, please do ask me.
I will be grateful for any and every comment/answer.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-16T20:21:54+00:00

This is straightforward to do if b1.wig is small enough to be read into a hash in memory, taking the keys from column 2 and the values from column 3. Then all that must be done is to access each key in each sequence, using zero if a corresponding hash element is non-existent (and so accessing it returns undef).

You haven’t said how you want to separate the new totals from the existing data from chr1trialS.out so I have used spaces. Of course this is easy to change if necessary.

use strict;
use warnings;

use constant SAMPLE_SIZE => 10;
use constant CHUNK_SIZE => 5;

my $file1 = 'chr1trialS.out';
my $file2 = 'b1.wig';

my %data2;
{
  open my $fh, '<', $file2 or die $!;

  while (<$fh>) {
    my ($key, $val) = (split)[1,2];
    $data2{$key} = $val;
  }
}

open my $fh, '<', $file1 or die $!;

while (<$fh>) {
  chomp;
  my $key = (split)[1];
  my @totals;
  my $n = 0;
  while ($n < SAMPLE_SIZE) {
    push @totals, 0 if $n++ % CHUNK_SIZE == 0;
    $totals[-1] += $data2{$key++} // 0;
  }
  print "$_ @totals\n";
}

output

chr1    9873    12227   11873   2354    +   NR_046018   DDX11L1 10 0
chr1    760970  763155  762970  2185    +   NR_047520   LOC643837 8 25

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am stuck with a rather unique problem. I have 2 files which I

Problem

Desired Output

Points to Note

Code

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply