I am working on some genome data and I have 2 files -> File1

Question

0

Editorial Team

Asked: May 30, 20262026-05-30T16:24:46+00:00 2026-05-30T16:24:46+00:00

I am working on some genome data and I have 2 files -> File1

0

I am working on some genome data and I have 2 files ->

File1

File2

>A1
CTATTATTTATCGCACCTACGTTCAATATTACAGGCGAACATACCTACTA
AAGTGTGTTAATTAATTAATGCTTGTAGGACATAATAATAACAATTGAAT
>A2
GTCTGCACAGCCGCTTTCCACACAGACATCATAACAAAAAATTTCCACCA
AACCCCCCCCTCCCCCCGCTTCTGGCCACAGCACTTAAACACATCTCTGC
CAAACCCCAAAAACAAAGAACCCTAACACCAGCCTAACCAGATTTCAAAT

In file 1, 2nd and 3rd column represents the indexes in File2. So I want that, if character in column1 of file1 matches with character followed by symbol (>) in file2 , then from next line of that file2 give back the substring according to indexes in col2 and col3 of file1. (sorry, I know its complicated) Here is the desire output ->

Output

>A1#1:10
CTATTATTTA
>A1#15:20
ACCTA
>A2#2:11
TCTGCACAGC
>A2#13:16
GCTT

I know if I have only 1 string I can take out sub-string very easily ->

@ARGV or die "No input file specified";
open $first, '<',$ARGV[0] or die "Unable to open input file: $!";
$string="GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCAT";
while (<$first>) 
{
@cols = split /\s+/;
$co=$cols[1]-1;
$length=$cols[2]-$co;
$fragment =  substr $string, $co, $length;
print ">",$cols[0],"#",$cols[1],":",$cols[2],"\n",$fragment,"\n";
}

but here my problem is when should I input my second file and how should I match the character in col1 (of file1) with character in file2 (followed by > symbol) and then how to get substring?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-30T16:24:48+00:00

I wasnt sure if they were all one continuous line or separate lines.
I set it up as continuous for now.

Basically, read the 2nd file as master.
Then you can process as many index files as you need.

You can use hash of arrays to help with the indexing.
push @{$index{$key}}, [$start,$stop];

use strict;
my $master_file = "dna_master.txt";
if ($#ARGV) {
    print "Usage: $0 [filename(s)]\n";
    exit 1;
}

my %Data = read_master($master_file);

foreach my $index_file (@ARGV) {
    my %Index = read_index($index_file);
    foreach my $key (sort keys %Index) {
        foreach my $i (@{$Index{$key}}) {
            my ($start,$stop) = @$i;
            print ">$key#$start:$stop\n";
            my $pos = $start - 1;
            my $count = $stop - $start + 1;
            print substr($Data{$key},$pos,$count)."\n";
        }
    }
}

sub read_file {
    my $file = shift;
    my @lines;
    open(FILE, $file) or die "Error: cannot open $file\n$!";
    while(<FILE>){
        chomp; #remove newline
        s/(^\s+|\s+$)//g; # strip lead/trail whitespace
        next if /^$/;  # skip blanks
        push @lines, $_;
    }
    close FILE;
    return @lines;
}

sub read_index {
    my $file = shift;
    my @lines = read_file($file);
    my %index;
    foreach (@lines) {
        my ($key,$start,$stop) = split /\s+/;
        push @{$index{$key}}, [$start,$stop]; 
    }
    return %index;
}

sub read_master {
    my $file = shift;
    my %master;
    my $key;
    my @lines = read_file($file);
    foreach (@lines) {
        if ( m{^>(\w+)} ) { $key = $1 }
        else { $master{$key} .= $_ }
    }
    return %master;
}

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am working on some genome data and I have 2 files -> File1

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply