Alright, I’ve been pretty interested in natural language processing recently: however, I’ve used C until now for most of my work. I heard of NLTK, and I didn’t know Python, but it seems quite easy to learn, and it’s looking like a really powerful and interesting language. In particular, the NLTK module seems very, very adapted to what I need to do.
However, when using sample code for NLTK and pasting that into a file called test.py, I’ve noticed it takes a very, very long time to run !
I’m calling it from the shell like so:
time python ./test.py
And on a 2.4 GHz machine with 4 GBs of RAM, it takes 19.187 seconds !
Now, maybe this is absolutely normal, but I was under the impression that NTLK was extremely fast; I may have been mistaken, but is there anything obvious that I’m clearly doing wrong here?
I believe you’re conflating training time with processing time. Training a model, like a UnigramTagger, can take a lot of time. So can loading that trained model from a pickle file on disk. But once you have a model loaded into memory, processing can quite fast. See the section called “Classifier Efficiency” at the bottom of my post on part of speech tagging with NLTK to get an idea of processing speed for different tagging algorithms.