I’m working on a project to parse out unique words from a large number of text files. I’ve got the file handling down, but I’m trying to refine the parsing procedure. Each file has a specific text segment that ends with certain phrases that I’m catching with a regex on my live system.
The parser should walk through each line, and check each word against 3 criteria:
- Longer than two characters
- Not in a predefined dictionary set
dict_file - Not already in the word list
The result should be a 2D array, each row a list of unique words per file, which is written to a CSV file using the .writerow(foo) method after each file is processed.
My working code’s below, but it’s slow and kludgy, what am I missing?
My production system is running 2.5.1 with just the default modules (so NLTK is a no-go), can’t be upgraded to 2.7+.
def process(line):
line_strip = line.strip()
return line_strip.translate(punct, string.punctuation)
# Directory walking and initialization here
report_set = set()
with open(fullpath, 'r') as report:
for line in report:
# Strip out the CR/LF and punctuation from the input line
line_check = process(line)
if line_check == "FOOTNOTES":
break
for word in line_check.split():
word_check = word.lower()
if ((word_check not in report_set) and (word_check not in dict_file)
and (len(word) > 2)):
report_set.append(word_check)
report_list = list(report_set)
Edit: Updated my code based on steveha’s recommendations.
One problem is that an
intest for alistis slow. You should probably keep asetto keep track of what words you have seen, because theintest for asetis very fast.Example:
Then when you are done:
report_list = list(report_set)
Anytime you need to force a
setinto alist, you can. But if you just need to loop over it or dointests, you can leave it as aset; it’s legal to dofor x in report_set:Another problem that might or might not matter is that you are slurping all the lines from the file in one go, using the
.readlines()method. For really large files it is better to just use the open file-handle object as an iterator, like so:A big problem is that I don’t even see how this code can work:
This will loop forever. The first statement slurps all input lines with
.readlines(), then we loop again, then the next call to.readlines()hasreportalready exhausted, so the call to.readlines()returns an empty list, which breaks out of the infinite loop. But this has now lost all the lines we just read, and the rest of the code must make do with an emptylinesvariable. How does this even work?So, get rid of that whole
while 1loop, and change the next loop tofor line in report:.Also, you don’t really need to keep a
countvariable. You can uselen(report_set)at any time to find out how many words are in theset.Also, with a
setyou don’t actually need to check whether a word isinthe set; you can just always callreport_set.add(word)and if it’s already in thesetit won’t be added again!Also, you don’t have to do it my way, but I like to make a generator that does all the processing. Strip the line, translate the line, split on whitespace, and yield up words ready to use. I would also force the words to lower-case except I don’t know whether it’s important that
FOOTNOTESbe detected only in upper-case.So, put all the above together and you get: