Python beginner, I have become familiar with reading through a file and doing basic operations. However now I want to filter through one file based on another. I want to filter file1 to remove any lines that have a score of less that 100000 in column 3 of file2.
I have a main data file(file1):
7 303 0.207756232686981
16 23 0.208562019758507
6 57 0.208727272727273
7 80 0.209065354884048
11 124 0.209500609013398
and I want to make a new data file identical to this one BUT removing any lines that have a score of less than 100000 based on information from a second file(file2):
chr7 303 292526
chr16 23 169805
chr6 57 62822
chr11 124 320564
chr7 80 300291
The first two columns of both files contain the information to determine if the line refers to the same case in both files. However the second file has the addition of ‘chr’ before each number(this ‘chr’ can be ignored).
All lines in the first file are present in the second file but there are some lines in the second file not in the first that can be ignored.
So looking at the example above the line:
6 57 0.208727272727273
would be removed from the new output because it has a value in the 3rd column of file 2 that is below 100,000 while all other lines in the first file would be included as thy have values over 100000. Also important for the output file to maintain the same line order as file 1.
Any help would be greatly appreciated.
I normally use the python structure of
for line in inputfile:
line = line.rstrip()
fields = line.split("\t")
so an answer building off this structure would be extra great.
Please let me know if the question is unclear.
Solution so far:
#!/usr/bin/env python
f2 = open( '/mnt/genotyping/CT/GreatApes/HKA/callability/callable_sites_per_region_500Kb.txt', 'r')
d2 = {}
print f2
for line in f2:
line = line.rstrip()
fields = line.split("\t")
key = (fields[0].replace('chr', ''), fields[1])
d2[key] = int(fields[2])
f1 = open( '/mnt/genotyping/CT/GreatApes/HKA/Barcelona_approach/500kb/cov_5/Homo-Gorilla/R_plots/Gorilla_genome_dist_cov5.txt', 'r')
for line in f1:
line = line.rstrip()
fields = line.split("\t")
if 'region' not in line:
key = (fields[0], fields[1])
if d2[key] >= 100000:
print line
Thanks
I used strings instead of files, but the principle remains the same. 1st, create a dict with keys of file2:
Then only print the lines of file1 checking values in
d2: