I am running a classification/feature extraction task on a windows server with 64GB of RAM, and somehow, python thinks i am running out of memory:
misiti@fff /cygdrive/c/NaiveBayes
$ python run_classify_comments.py > tenfoldcrossvalidation.txt
Traceback (most recent call last):
File "run_classify_comments.py", line 70, in <module>
run_classify_comments()
File "run_classify_comments.py", line 51, in run_classify_comments
NWORDS = get_all_words("./data/HUGETEXTFILE.txt")
File "run_classify_comments.py", line 16, in get_all_words
def get_all_words(path): return words(file(path).read())
File "run_classify_comments.py", line 15, in words
def words(text): return re.findall('[a-z]+', text.lower())
File "C:\Program Files (x86)\Python26\lib\re.py", line 175, in findall
return _compile(pattern, flags).findall(string)
MemoryError
So the re module is crashing with 64 GB of RAM…I do not think so…
Why is this happening, and how can I configure python to use all available RAM on my machine?
Just rewrite your program to read your huge text file one line at a time. This is easily done by just changing
get_all_words(path)to:Note the use of a generator in the parenthesis, which is lazy and will evaluate on demand by the sum function.