I’m stuck in a script I have to write and can’t find a way

Question

0

Asked: June 16, 20262026-06-16T20:23:25+00:00 2026-06-16T20:23:25+00:00

I’m stuck in a script I have to write and can’t find a way

0

I’m stuck in a script I have to write and can’t find a way out…

I have two files with partly overlapping information. Based on the information in one file I have to extract info from the other and save it into multiple new files.
The first is simply a table with IDs and group information (which is used for the splitting).
The other contains the same IDs, but each twice with slightly different information.

What I’m doing:
I create a list of lists with ID and group informazion, like this:

table = [[ID, group], [ID, group], [ID, group], ...]

Then, because the second file is huge and not sorted in the same way as the first, I want to create a dictionary as index. In this index, I would like to save the ID and where it can be found inside the file so I can quickly jump there later. The problem there, of course, is that every ID appears twice. My simple solution (but I’m in doubt about this) is adding an -a or -b to the ID:

index = {"ID-a": [FPos, length], "ID-b": [FPOS, length], "ID-a": [FPos, length], ...}

The code for this:

for line in file:
    read = (line.split("\t"))[0]
    if not (read+"-a") in indices:
        index = read + "-a"
        length = len(line)
        indices[index] = [FPos, length]
    else:
        index = read + "-b"
        length = len(line)
        indices[index] =  [FPos, length]
    FPos += length

What I am wondering now is if the next step is actually valid (I don’t get errors, but I have some doubts about the output files).

for name in table:
    head = name[0]
    ## first round
    (FPos,length) = indices[head+"-a"]
    file.seek(FPos)
    line = file.read(length)
    line = line.rstrip()
    items = line.split("\t")
    output = ["@" + head +" "+ "1:N:0:" +"\n"+ items[9] +"\n"+ "+" +"\n"+ items[10] +"\n"]
    name.append(output)
    ##second round
    (FPos,length) = indices[head+"-b"]
    file.seek(FPos)
    line = file.read(length)
    line = line.rstrip()
    items = line.split("\t")
    output = ["@" + head +" "+ "2:N:0:" +"\n"+ items[9] +"\n"+ "+" +"\n"+ items[10] +"\n"]
    name.append(output)

Is it ok to use a for loop like that?

Is there a better, cleaner way to do this?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-16T20:23:27+00:00

Use a defaultdict(list) to save all your file offsets by ID:

from collections import defaultdict

index = defaultdict(list)

for line in file:
    # ...code that loops through file finding ID lines...
    index[id_value].append((fileposn,length))

The defaultdict will take care of initializing to an empty list on the first occurrence of a given id_value, and then the (fileposn,length) tuple will be appended to it.

This will accumulate all references to each id into the index, whether there are 1, 2, or 20 references. Then you can just search through the given fileposn’s for the related data.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m stuck in a script I have to write and can’t find a way

I’m stuck in a script I have to write and can’t find a way out…

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply