I am fairly new to programming and trying to resolve this problem. I have

Question

0

Asked: June 11, 20262026-06-11T15:43:20+00:00 2026-06-11T15:43:20+00:00

I am fairly new to programming and trying to resolve this problem. I have

0

I am fairly new to programming and trying to resolve this problem. I have the file like this.

CHROM    POS     REF     ALT    10_sample.bam   11_sample.bam   12_sample.bam   13_sample.bam   14_sample.bam   15_sample.bam   16_sample.bam 
tg93    77  T   C   T   T   T   T           T
tg93    79  C   -   C       C   C   -   -   
tg93    79  C   G   C   C   C   C   G       C
tg93    80  G   A   G   G   G   G   A   A   G
tg93    81  A   C   A   A   A   A   C   C   C
tg93    86  C   A   C   C   A   A   A   A   C
tg93    105 A   G   A   A   A   A   A   G   A
tg93    108 A   G   A   A   A   A   G   A   A
tg93    114 T   C   T   T   T   T   T   C   T
tg93    131 A   C   A   A   A   A   A   A   A
tg93    136 G   C   C   G   C   C   G   G   G
tg93    150 CTCTC   -       CTCTC       -   CTCTC       CTCTC

In this file, in the heading

CHROM – name
POS – position
REF – reference
ALT – alternate
10 – 16_sample.bam – samplesd
I

Now i wanted to see how many times the letter in REF and ALT column occured. If either of them is repeated less than two times, i need to delete that row.

For example
In the first row, i have ‘T’ in REF and ‘C’ in ALT . I see in 7 samples, there are 5 T’s and 2 blanks and no C. So i need to delete this row.

In Second row, REF is ‘C’ and Alt is ‘-‘. Now in seven samples we have 3 C’s, 2 ‘-‘s and 2 blanks. So we keep this row as C and – have repeated more than 2 times.
Always we ignore the blanks while counting

The final file after filtering is

#CHROM   POS     REF     ALT    10_sample.bam   11_sample.bam   12_sample.bam   13_sample.bam   14_sample.bam   15_sample.bam   16_sample.bam 
tg93    79  C   -   C       C   C   -   -   
tg93    80  G   A   G   G   G   G   A   A   G
tg93    81  A   C   A   A   A   A   C   C   C
tg93    86  C   A   C   C   A   A   A   A   C
tg93    136 G   C   C   G   C   C   G   G   G

I am able to read the columns in to arrays and display them in the code but i am not sure how to start the loops to read the base and count their occurrences and remain the column. Can anyone tell me how i should be proceeding with this? Or it will be helpful if you have any example code i can modify up on.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-11T15:43:21+00:00

#!/usr/bin/env perl
use strict;
use warnings;

print scalar(<>);                   # Read and output the header.

while (<>) {                        # Read a line.
   chomp;                           # Remove the newline from the line.
   my ($chrom, $pos, $ref, $alt, @samples) =
      split /\t/;                   # Parse the remainder of the line.

   my %counts;                      # Count the occurrences of sample values.
   ++$counts{$_} for @samples;      # e.g. Might end up with $counts{"G"} = 3.

   print "$_\n"                     # Print line if we want to keep it.
      if ($counts{$ref} || 0) >= 2  # ("|| 0" avoids a spurious warning.)
      && ($counts{$alt} || 0) >= 2;
}

Output:

CHROM    POS     REF     ALT    10_sample.bam   11_sample.bam   12_sample.bam   13_sample.bam   14_sample.bam   15_sample.bam   16_sample.bam 
tg93    79  C   -   C       C   C   -   -   
tg93    80  G   A   G   G   G   G   A   A   G
tg93    81  A   C   A   A   A   A   C   C   C
tg93    86  C   A   C   C   A   A   A   A   C
tg93    136 G   C   C   G   C   C   G   G   G

You included 108 in your desired output, but it only has one instance of ALT in the seven samples.

Usage:

perl script.pl file.in >file.out

Or in-place:

perl -i script.pl file

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am fairly new to programming and trying to resolve this problem. I have

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply