So I have a uniformly formatted text file that I am trying to parse

Question

0

Asked: June 18, 20262026-06-18T16:15:41+00:00 2026-06-18T16:15:41+00:00

So I have a uniformly formatted text file that I am trying to parse

0

So I have a uniformly formatted text file that I am trying to parse based on the number of lines below the word ‘cluster’. Here is my code so far:

f = open('file.txt', 'r')
main_output = open('mainoutput.txt', 'w')
minor_output = open('minoroutput.txt', 'w')
f_lines = f.readlines()
main_list = []
minor_list = []
for n, line in enumerate(open('file.txt')):
    if 'cluster' in line:
        if 'cluster' in f_lines[n+1] or f_lines[n+2] or f_lines[n+3]:
            minor_list.append(line)
            minor_list.append(f_lines[n+1])
            minor_list.append(f_lines[n+2])
            minor_list.append(f_lines[n+3])
        if 'cluster' not in f_lines[n+1] or f_lines[n+2] or f_lines[n+3]:
            main_list.append(line)
            main_list.append(f_lines[n+1])
            main_list.append(f_lines[n+2])
            main_list.append(f_lines[n+3])
minor_output.write(''.join(minor_list))
main_output.write(''.join(main_list))
f.close()
main_output.close()
minor_output.close()

The format of the text file is as follows:

>Cluster 1
line 1
line 2
line 3
...

>Cluster 2
line 1
line 2
...

and so on for many clusters.

Each cluster has a variable number of lines below it, from 1 to 100+. I am interested in sorting these clusters by the number of lines(items) in each cluster. This code is working but the two output files are identical. Any help with my code or my strategy would be awesome!

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-18T16:15:43+00:00

If I understand the code you’ve posted correctly, you want to sort your data into two different files depending on how many items are in a cluster. If there are three or fewer, the cluster goes into minoroutput.txt, while if there are more than that, it goes into mainoutput.txt.

There are a couple of significant logic errors that I suspect are causing your code to not sort the data properly.

Firstly, your test to see if an line contains the word "cluster" won’t match capitalized "Cluster" like you have in your example data. This may only be an issue with the example data you’ve shown, and it’s would be easy to fix by calling lower() on the line before checking it.

Second, your check of later lines is incorrect. The code if 'cluster' in f_lines[n+1] or f_lines[n+2] or f_lines[n+3] doesn’t check for "cluster" in each of the three strings, but rather only in the first. The second and third strings are being evaluated all by themselves, in boolean context. If they’re not empty lines, they’ll be True, making the whole expression almost always true as well. For this to work, you’d need to check 'cluster' in f_lines[n+1] or 'cluster' in f_lines[n+2] or 'cluster' in f_lines[n+3] (but I’ll show a better alternative later). The same problem happens with the other if statement, where you will also almost always get a True result from your condition, since f_lines[n+2] and f_lines[n+3] are probably not both empty.

Lastly, your logic for writing out the clusters is probably incorrect. It currently writes out exactly four lines always, even though many clusters will have more or fewer items than that. For every cluster written to mainoutput.txt, some lines will be discarded (this might be deliberate). For some cluster’s written to minoroutupt.txt, however, there’s going to be a clear bug where it will write out the start of the next cluster after a cluster with only one or two items.

Here’s some code that I think will work for you. I’ve changed around the loop so that it just reads the file once, rather than reading the lines once into a list and a second time in enumerate. Rather than explicitly looking at the next three lines, I simply put each line into a list, resetting each time there’s a line with cluster in it (with any capitalization).

with open('file.txt', 'r') as f, \
     open('mainoutput.txt', 'w') as main_out, \
     open('minoroutput.txt', 'w') as minor_out:
    cluster = [] # this variable will hold all the lines of the current cluster
    for line in f:
        if 'cluster' in line.lower(): # if we're at the start of a cluster
            if len(cluster) > 4: # long clusters go in the "main" file
                main_out.writelines(cluster) # write out the lines
                # main_out.writelines(cluster[:4])
            else:
                minor_out.writelines(cluster) # or to the other file

            cluster = [] # reset the cluster variable to a new, empty list

        cluster.append(line) # always add the current line to cluster

    if len(cluster) > 4: # repeat the writing logic for the last cluster
        main_out.writelines(cluster)
        # main_out.writelines(cluster[:4])
    else:
        minor_out.writelines(cluster)

Use the two commented writelines lines in place of the uncommented ones just before them if you only want the first three items in a cluster to be output into mainout.txt (with the rest being discarded). I don’t think there’s a reasonable alternative to printing all the lines in minorout.txt.

Given file.txt with these contents:

>Cluster 1
line 1
line 2
line 3
>Cluster 2
line 1
line 2
line 3
line 4
>Cluster 3
line 1
>Cluster 4
line 1
line 2
line 3
line 4
line 5

The code above will output two files:

mainoutput.txt:

>Cluster 2
line 1
line 2
line 3
line 4
>Cluster 4
line 1
line 2
line 3
line 4
line 5

minoroutput.txt:

>Cluster 1
line 1
line 2
line 3
>Cluster 3
line 1

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

So I have a uniformly formatted text file that I am trying to parse

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply