I’m working on a project to parse out unique words from a large number

Question

0

Asked: June 5, 20262026-06-05T22:06:59+00:00 2026-06-05T22:06:59+00:00

I’m working on a project to parse out unique words from a large number

0

I’m working on a project to parse out unique words from a large number of text files. I’ve got the file handling down, but I’m trying to refine the parsing procedure. Each file has a specific text segment that ends with certain phrases that I’m catching with a regex on my live system.

The parser should walk through each line, and check each word against 3 criteria:

Longer than two characters
Not in a predefined dictionary set dict_file
Not already in the word list

The result should be a 2D array, each row a list of unique words per file, which is written to a CSV file using the .writerow(foo) method after each file is processed.

My working code’s below, but it’s slow and kludgy, what am I missing?

My production system is running 2.5.1 with just the default modules (so NLTK is a no-go), can’t be upgraded to 2.7+.

def process(line):
    line_strip = line.strip()
    return line_strip.translate(punct, string.punctuation)

# Directory walking and initialization here
report_set = set()
with open(fullpath, 'r') as report:
    for line in report:
        # Strip out the CR/LF and punctuation from the input line
        line_check = process(line)
        if line_check == "FOOTNOTES":
            break
        for word in line_check.split():
            word_check = word.lower()
            if ((word_check not in report_set) and (word_check not in dict_file) 
                 and (len(word) > 2)):
                report_set.append(word_check)
report_list = list(report_set)

Edit: Updated my code based on steveha’s recommendations.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-05T22:07:02+00:00

One problem is that an in test for a list is slow. You should probably keep a set to keep track of what words you have seen, because the in test for a set is very fast.

Example:

report_set = set()
for line in report:
    for word in line.split():
        if we_want_to_keep_word(word):
            report_set.add(word)

Then when you are done:
report_list = list(report_set)

Anytime you need to force a set into a list, you can. But if you just need to loop over it or do in tests, you can leave it as a set; it’s legal to do for x in report_set:

Another problem that might or might not matter is that you are slurping all the lines from the file in one go, using the .readlines() method. For really large files it is better to just use the open file-handle object as an iterator, like so:

with open("filename", "r") as f:
    for line in f:
        ... # process each line here

A big problem is that I don’t even see how this code can work:

while 1:
    lines = report.readlines()
    if not lines:
        break

This will loop forever. The first statement slurps all input lines with .readlines(), then we loop again, then the next call to .readlines() has report already exhausted, so the call to .readlines() returns an empty list, which breaks out of the infinite loop. But this has now lost all the lines we just read, and the rest of the code must make do with an empty lines variable. How does this even work?

So, get rid of that whole while 1 loop, and change the next loop to for line in report:.

Also, you don’t really need to keep a count variable. You can use len(report_set) at any time to find out how many words are in the set.

Also, with a set you don’t actually need to check whether a word is in the set; you can just always call report_set.add(word) and if it’s already in the set it won’t be added again!

Also, you don’t have to do it my way, but I like to make a generator that does all the processing. Strip the line, translate the line, split on whitespace, and yield up words ready to use. I would also force the words to lower-case except I don’t know whether it’s important that FOOTNOTES be detected only in upper-case.

So, put all the above together and you get:

def words(file_object):
    for line in file_object:
        line = line.strip().translate(None, string.punctuation)
        for word in line.split():
            yield word

report_set = set()
with open(fullpath, 'r') as report:
    for word in words(report):
        if word == "FOOTNOTES":
            break
        word = word.lower()
        if len(word) > 2 and word not in dict_file:
            report_set.add(word)

print("Words in report_set: %d" % len(report_set))

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m working on a project to parse out unique words from a large number

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply