So I have a uniformly formatted text file that I am trying to parse based on the number of lines below the word ‘cluster’. Here is my code so far:
f = open('file.txt', 'r')
main_output = open('mainoutput.txt', 'w')
minor_output = open('minoroutput.txt', 'w')
f_lines = f.readlines()
main_list = []
minor_list = []
for n, line in enumerate(open('file.txt')):
if 'cluster' in line:
if 'cluster' in f_lines[n+1] or f_lines[n+2] or f_lines[n+3]:
minor_list.append(line)
minor_list.append(f_lines[n+1])
minor_list.append(f_lines[n+2])
minor_list.append(f_lines[n+3])
if 'cluster' not in f_lines[n+1] or f_lines[n+2] or f_lines[n+3]:
main_list.append(line)
main_list.append(f_lines[n+1])
main_list.append(f_lines[n+2])
main_list.append(f_lines[n+3])
minor_output.write(''.join(minor_list))
main_output.write(''.join(main_list))
f.close()
main_output.close()
minor_output.close()
The format of the text file is as follows:
>Cluster 1
line 1
line 2
line 3
...
>Cluster 2
line 1
line 2
...
and so on for many clusters.
Each cluster has a variable number of lines below it, from 1 to 100+. I am interested in sorting these clusters by the number of lines(items) in each cluster. This code is working but the two output files are identical. Any help with my code or my strategy would be awesome!
If I understand the code you’ve posted correctly, you want to sort your data into two different files depending on how many items are in a cluster. If there are three or fewer, the cluster goes into
minoroutput.txt, while if there are more than that, it goes intomainoutput.txt.There are a couple of significant logic errors that I suspect are causing your code to not sort the data properly.
Firstly, your test to see if an line contains the word
"cluster"won’t match capitalized"Cluster"like you have in your example data. This may only be an issue with the example data you’ve shown, and it’s would be easy to fix by callinglower()on the line before checking it.Second, your check of later lines is incorrect. The code
if 'cluster' in f_lines[n+1] or f_lines[n+2] or f_lines[n+3]doesn’t check for"cluster"in each of the three strings, but rather only in the first. The second and third strings are being evaluated all by themselves, in boolean context. If they’re not empty lines, they’ll beTrue, making the whole expression almost always true as well. For this to work, you’d need to check'cluster' in f_lines[n+1] or 'cluster' in f_lines[n+2] or 'cluster' in f_lines[n+3](but I’ll show a better alternative later). The same problem happens with the otherifstatement, where you will also almost always get aTrueresult from your condition, sincef_lines[n+2]andf_lines[n+3]are probably not both empty.Lastly, your logic for writing out the clusters is probably incorrect. It currently writes out exactly four lines always, even though many clusters will have more or fewer items than that. For every cluster written to
mainoutput.txt, some lines will be discarded (this might be deliberate). For some cluster’s written tominoroutupt.txt, however, there’s going to be a clear bug where it will write out the start of the next cluster after a cluster with only one or two items.Here’s some code that I think will work for you. I’ve changed around the loop so that it just reads the file once, rather than reading the lines once into a list and a second time in
enumerate. Rather than explicitly looking at the next three lines, I simply put each line into a list, resetting each time there’s a line withclusterin it (with any capitalization).Use the two commented
writelineslines in place of the uncommented ones just before them if you only want the first three items in a cluster to be output intomainout.txt(with the rest being discarded). I don’t think there’s a reasonable alternative to printing all the lines inminorout.txt.Given
file.txtwith these contents:The code above will output two files:
mainoutput.txt:minoroutput.txt: