I am trying to convert one of my Perl script to R script. I

Question

0

Asked: June 11, 20262026-06-11T22:19:15+00:00 2026-06-11T22:19:15+00:00

I am trying to convert one of my Perl script to R script. I

0

I am trying to convert one of my Perl script to R script. I have a dataframe in R which looks like (Ignore the column names)-

CHR     START          END      TYPE
chr1    945493         945593   normal
chr1    945593        947374    normal
chr1    947374        947474    normal
chr1    947474        947574    gain
chr1    947574        947674    gain
chr1    947674        960364    gain
chr1    960364        960464    normal
chr22   17290491    17290591    normal
chr22   17290591    17290691    normal
chr22   17290691    17290791    gain
chr22   17290791    17292513    gain
chr22   17292513    17292613    gain
chr22   17292613    17292713    gain
chr22   17292713    17293046    gain
chr22   17293346    17298475    gain
chr22   17298475    17298575    gain
chr22   17298575    17298675    normal
chr22   17298675    17303632    normal
chr22   17303632    17303732    loss
chr22   17303732    17303832    normal
chrX    154162621   154181221   normal
chrX    154181221   154181321   normal
chrX    154181321   154181421   loss
chrX    154181421   154181521   loss
chrX    154181521   154181621   loss
chrX    154181621   154181721   loss
chrX    154181721   154216867   loss
chrX    154216867   154216967   normal
chrX    154216967   154217067   normal
chrX    154217067   154217167   normal

If at least 5 continuous rows have same value in “CHR” column and “TYPE” column, then combine all those rows in one row so that START column should have value of first row and END column have value of last row and in the end just return rows which have “gain” or “loss” TYPE. So the desired output is:

chr22   17290691        17298575        gain
chrX    154181321       154216867       loss

What I am doing right now is:

Saving the dataframe with “write.table”.

Use this perl script:

  open $first, "<",$ARGV[0] or die "Unable to open input file: $!";
  my $count=1;
  $_ = <$first>;
  chomp;
  my ($p_key, $p_col1, $p_col2,$p_cnv) = split;

  while(<$first>) {
      chomp;
      my ($key, $col1, $col2,$cnv) = split;
      if ($key eq $p_key and $cnv eq  $p_cnv) {
        $p_col2 = $col2;
        $count++;

      } elsif ($count > 4){


         print $p_key,"\t", $p_col1,"\t", $p_col2,"\t", $p_cnv,"\n" if($p_cnv eq "gain" or $p_cnv eq "loss");
         ($p_key, $p_col1, $p_col2, $p_cnv) = ($key, $col1, $col2, $cnv);
         $count=1;
        }

       else { 

    ($p_key, $p_col1, $p_col2, $p_cnv) = ($key, $col1, $col2, $cnv);
        $count=1;
       }
}

I think this is an extra step to save the dataframe first and then use Perl script. Could anyone please suggest an easier way to do this in R – any package or any other trick?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-11T22:19:16+00:00

I was concerned that you should want to interrupt the sequences (i.e. consider them as distinct ) if there were intervening alternate values for TYPE within one chromosome. You didn’t specifically state it as such but I think the biology would warrant that additional requirement. Hence the need for another variable to be created. We will assume the dataframe is named cdat, in the absence of advice to the contrary. This looks within consecutive runs of TYPE, applies the test, and binds the CHR and START at the beginning and the END and TYPE for the last element.

cdat$conseq <-cumsum(c(1, cdat$TYPE[-1] != cdat$TYPE[-length(cdat$TYPE)] ) )
do.call( rbind, 
    by(cdat, list(cdat$CHR, cdat$conseq), 
         function(df)
            if( NROW(df) >=5 & df$TYPE[1] %in% c("gain", "loss") ) {
                cbind(df[1, c("CHR", "START")] , df[NROW(df), c("END", "TYPE")] ) 
                } else{NULL} ) )
     CHR     START       END TYPE
10 chr22  17290691  17298575 gain
23  chrX 154181321 154216867 loss

The conseq vector is built up by comparing the next TYPE value to its prior value and cumsum()-ing the appearance of a new value along its full length. Since those variables are one element shorter. the 1 is added as a placeholder at the beginning to let it line up with the dataframe.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am trying to convert one of my Perl script to R script. I

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply