I am trying to convert one of my Perl script to R script. I have a dataframe in R which looks like (Ignore the column names)-
CHR START END TYPE chr1 945493 945593 normal chr1 945593 947374 normal chr1 947374 947474 normal chr1 947474 947574 gain chr1 947574 947674 gain chr1 947674 960364 gain chr1 960364 960464 normal chr22 17290491 17290591 normal chr22 17290591 17290691 normal chr22 17290691 17290791 gain chr22 17290791 17292513 gain chr22 17292513 17292613 gain chr22 17292613 17292713 gain chr22 17292713 17293046 gain chr22 17293346 17298475 gain chr22 17298475 17298575 gain chr22 17298575 17298675 normal chr22 17298675 17303632 normal chr22 17303632 17303732 loss chr22 17303732 17303832 normal chrX 154162621 154181221 normal chrX 154181221 154181321 normal chrX 154181321 154181421 loss chrX 154181421 154181521 loss chrX 154181521 154181621 loss chrX 154181621 154181721 loss chrX 154181721 154216867 loss chrX 154216867 154216967 normal chrX 154216967 154217067 normal chrX 154217067 154217167 normal
If at least 5 continuous rows have same value in “CHR” column and “TYPE” column, then combine all those rows in one row so that START column should have value of first row and END column have value of last row and in the end just return rows which have “gain” or “loss” TYPE. So the desired output is:
chr22 17290691 17298575 gain chrX 154181321 154216867 loss
What I am doing right now is:
- Saving the dataframe with “write.table”.
-
Use this perl script:
open $first, "<",$ARGV[0] or die "Unable to open input file: $!"; my $count=1; $_ = <$first>; chomp; my ($p_key, $p_col1, $p_col2,$p_cnv) = split; while(<$first>) { chomp; my ($key, $col1, $col2,$cnv) = split; if ($key eq $p_key and $cnv eq $p_cnv) { $p_col2 = $col2; $count++; } elsif ($count > 4){ print $p_key,"\t", $p_col1,"\t", $p_col2,"\t", $p_cnv,"\n" if($p_cnv eq "gain" or $p_cnv eq "loss"); ($p_key, $p_col1, $p_col2, $p_cnv) = ($key, $col1, $col2, $cnv); $count=1; } else { ($p_key, $p_col1, $p_col2, $p_cnv) = ($key, $col1, $col2, $cnv); $count=1; } }
I think this is an extra step to save the dataframe first and then use Perl script. Could anyone please suggest an easier way to do this in R – any package or any other trick?
I was concerned that you should want to interrupt the sequences (i.e. consider them as distinct ) if there were intervening alternate values for TYPE within one chromosome. You didn’t specifically state it as such but I think the biology would warrant that additional requirement. Hence the need for another variable to be created. We will assume the dataframe is named
cdat, in the absence of advice to the contrary. This looks within consecutive runs of TYPE, applies the test, and binds the CHR and START at the beginning and the END and TYPE for the last element.The conseq vector is built up by comparing the next TYPE value to its prior value and cumsum()-ing the appearance of a new value along its full length. Since those variables are one element shorter. the 1 is added as a placeholder at the beginning to let it line up with the dataframe.