Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8098579
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 5, 20262026-06-05T22:06:59+00:00 2026-06-05T22:06:59+00:00

I’m working on a project to parse out unique words from a large number

  • 0

I’m working on a project to parse out unique words from a large number of text files. I’ve got the file handling down, but I’m trying to refine the parsing procedure. Each file has a specific text segment that ends with certain phrases that I’m catching with a regex on my live system.

The parser should walk through each line, and check each word against 3 criteria:

  1. Longer than two characters
  2. Not in a predefined dictionary set dict_file
  3. Not already in the word list

The result should be a 2D array, each row a list of unique words per file, which is written to a CSV file using the .writerow(foo) method after each file is processed.

My working code’s below, but it’s slow and kludgy, what am I missing?

My production system is running 2.5.1 with just the default modules (so NLTK is a no-go), can’t be upgraded to 2.7+.

def process(line):
    line_strip = line.strip()
    return line_strip.translate(punct, string.punctuation)

# Directory walking and initialization here
report_set = set()
with open(fullpath, 'r') as report:
    for line in report:
        # Strip out the CR/LF and punctuation from the input line
        line_check = process(line)
        if line_check == "FOOTNOTES":
            break
        for word in line_check.split():
            word_check = word.lower()
            if ((word_check not in report_set) and (word_check not in dict_file) 
                 and (len(word) > 2)):
                report_set.append(word_check)
report_list = list(report_set)

Edit: Updated my code based on steveha’s recommendations.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-05T22:07:02+00:00Added an answer on June 5, 2026 at 10:07 pm

    One problem is that an in test for a list is slow. You should probably keep a set to keep track of what words you have seen, because the in test for a set is very fast.

    Example:

    report_set = set()
    for line in report:
        for word in line.split():
            if we_want_to_keep_word(word):
                report_set.add(word)
    

    Then when you are done:
    report_list = list(report_set)

    Anytime you need to force a set into a list, you can. But if you just need to loop over it or do in tests, you can leave it as a set; it’s legal to do for x in report_set:

    Another problem that might or might not matter is that you are slurping all the lines from the file in one go, using the .readlines() method. For really large files it is better to just use the open file-handle object as an iterator, like so:

    with open("filename", "r") as f:
        for line in f:
            ... # process each line here
    

    A big problem is that I don’t even see how this code can work:

    while 1:
        lines = report.readlines()
        if not lines:
            break
    

    This will loop forever. The first statement slurps all input lines with .readlines(), then we loop again, then the next call to .readlines() has report already exhausted, so the call to .readlines() returns an empty list, which breaks out of the infinite loop. But this has now lost all the lines we just read, and the rest of the code must make do with an empty lines variable. How does this even work?

    So, get rid of that whole while 1 loop, and change the next loop to for line in report:.

    Also, you don’t really need to keep a count variable. You can use len(report_set) at any time to find out how many words are in the set.

    Also, with a set you don’t actually need to check whether a word is in the set; you can just always call report_set.add(word) and if it’s already in the set it won’t be added again!

    Also, you don’t have to do it my way, but I like to make a generator that does all the processing. Strip the line, translate the line, split on whitespace, and yield up words ready to use. I would also force the words to lower-case except I don’t know whether it’s important that FOOTNOTES be detected only in upper-case.

    So, put all the above together and you get:

    def words(file_object):
        for line in file_object:
            line = line.strip().translate(None, string.punctuation)
            for word in line.split():
                yield word
    
    report_set = set()
    with open(fullpath, 'r') as report:
        for word in words(report):
            if word == "FOOTNOTES":
                break
            word = word.lower()
            if len(word) > 2 and word not in dict_file:
                report_set.add(word)
    
    print("Words in report_set: %d" % len(report_set))
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have a bunch of posts stored in text files formatted in yaml/textile (from
For some reason, after submitting a string like this Jack’s Spindle from a text
I have a text area in my form which accepts all possible characters from
i want to parse a xhtml file and display in UITableView. what is the
I'm trying to decode HTML entries from here NYTimes.com and I cannot figure out
I have a reasonable size flat file database of text documents mostly saved in
I'm working with an upstream system that sometimes sends me text destined for HTML/XML
link Im having trouble converting the html entites into html characters, (&# 8217;) i
I have just tried to save a simple *.rtf file with some websites and
I have a string like this: La Torre Eiffel paragonata all’Everest What PHP function

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.