I’m stuck in a script I have to write and can’t find a way out…
I have two files with partly overlapping information. Based on the information in one file I have to extract info from the other and save it into multiple new files.
The first is simply a table with IDs and group information (which is used for the splitting).
The other contains the same IDs, but each twice with slightly different information.
What I’m doing:
I create a list of lists with ID and group informazion, like this:
table = [[ID, group], [ID, group], [ID, group], ...]
Then, because the second file is huge and not sorted in the same way as the first, I want to create a dictionary as index. In this index, I would like to save the ID and where it can be found inside the file so I can quickly jump there later. The problem there, of course, is that every ID appears twice. My simple solution (but I’m in doubt about this) is adding an -a or -b to the ID:
index = {"ID-a": [FPos, length], "ID-b": [FPOS, length], "ID-a": [FPos, length], ...}
The code for this:
for line in file:
read = (line.split("\t"))[0]
if not (read+"-a") in indices:
index = read + "-a"
length = len(line)
indices[index] = [FPos, length]
else:
index = read + "-b"
length = len(line)
indices[index] = [FPos, length]
FPos += length
What I am wondering now is if the next step is actually valid (I don’t get errors, but I have some doubts about the output files).
for name in table:
head = name[0]
## first round
(FPos,length) = indices[head+"-a"]
file.seek(FPos)
line = file.read(length)
line = line.rstrip()
items = line.split("\t")
output = ["@" + head +" "+ "1:N:0:" +"\n"+ items[9] +"\n"+ "+" +"\n"+ items[10] +"\n"]
name.append(output)
##second round
(FPos,length) = indices[head+"-b"]
file.seek(FPos)
line = file.read(length)
line = line.rstrip()
items = line.split("\t")
output = ["@" + head +" "+ "2:N:0:" +"\n"+ items[9] +"\n"+ "+" +"\n"+ items[10] +"\n"]
name.append(output)
Is it ok to use a for loop like that?
Is there a better, cleaner way to do this?
Use a
defaultdict(list)to save all your file offsets by ID:The defaultdict will take care of initializing to an empty list on the first occurrence of a given id_value, and then the (fileposn,length) tuple will be appended to it.
This will accumulate all references to each id into the index, whether there are 1, 2, or 20 references. Then you can just search through the given fileposn’s for the related data.