Python beginner, I have become familiar with reading through a file and doing basic

Question

0

Asked: June 12, 20262026-06-12T02:49:18+00:00 2026-06-12T02:49:18+00:00

Python beginner, I have become familiar with reading through a file and doing basic

0

Python beginner, I have become familiar with reading through a file and doing basic operations. However now I want to filter through one file based on another. I want to filter file1 to remove any lines that have a score of less that 100000 in column 3 of file2.
I have a main data file(file1):

7   303 0.207756232686981
16  23  0.208562019758507
6   57  0.208727272727273
7   80  0.209065354884048
11  124 0.209500609013398

and I want to make a new data file identical to this one BUT removing any lines that have a score of less than 100000 based on information from a second file(file2):

chr7    303 292526
chr16   23  169805
chr6    57  62822
chr11   124 320564
chr7    80  300291

The first two columns of both files contain the information to determine if the line refers to the same case in both files. However the second file has the addition of ‘chr’ before each number(this ‘chr’ can be ignored).
All lines in the first file are present in the second file but there are some lines in the second file not in the first that can be ignored.

So looking at the example above the line:

6   57  0.208727272727273

would be removed from the new output because it has a value in the 3rd column of file 2 that is below 100,000 while all other lines in the first file would be included as thy have values over 100000. Also important for the output file to maintain the same line order as file 1.

Any help would be greatly appreciated.
I normally use the python structure of

for line in inputfile:
        line = line.rstrip() 
        fields = line.split("\t")

so an answer building off this structure would be extra great.

Please let me know if the question is unclear.

Solution so far:

#!/usr/bin/env python



f2 = open( '/mnt/genotyping/CT/GreatApes/HKA/callability/callable_sites_per_region_500Kb.txt', 'r')
d2 = {}
print f2
for line in f2:
    line = line.rstrip()
    fields = line.split("\t")
    key = (fields[0].replace('chr', ''), fields[1])
    d2[key] = int(fields[2])





f1 = open( '/mnt/genotyping/CT/GreatApes/HKA/Barcelona_approach/500kb/cov_5/Homo-Gorilla/R_plots/Gorilla_genome_dist_cov5.txt', 'r')
for line in f1:
    line = line.rstrip()
    fields = line.split("\t")
    if 'region' not in line:
        key = (fields[0], fields[1])
        if d2[key] >= 100000:
            print line

Thanks

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-12T02:49:20+00:00

I used strings instead of files, but the principle remains the same. 1st, create a dict with keys of file2:

>>> f2 = """chr7\t303\t292526
chr16\t23\t169805
chr6\t57\t62822
chr11\t124\t320564
chr7\t80\t300291"""
>>> d2 = {}
>>> for line in f2.split('\n'):
    line = line.rstrip()
    fields = line.split("\t")
    key = (fields[0].replace('chr', ''), fields[1])
    d2[key] = int(fields[2])


>>> d2
{('7', '303'): 292526, ('7', '80'): 300291, ('16', '23'): 169805, ('6', '57'): 62822, ('11', '124'): 320564}

Then only print the lines of file1 checking values in d2:

>>> f1 = """7\t303\t0.207756232686981
16\t23\t0.208562019758507
6\t57\t0.208727272727273
7\t80\t0.209065354884048
11\t124\t0.209500609013398"""
>>> for line in f1.split('\n'):
    line = line.rstrip()
    fields = line.split("\t")
    key = (fields[0], fields[1])
    if d2[key] >= 100000:
        print line


7   303 0.207756232686981
16  23  0.208562019758507
7   80  0.209065354884048
11  124 0.209500609013398
>>>

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Python beginner, I have become familiar with reading through a file and doing basic

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply