I am basically trying to write a multicore version of mapreduce just to see whether i got the concept or not. And also wanted to learn threading in python as well.
I have lets say two chunks of text string.
How do I process them (let say tokenize them to words) simultaneously using multi-threads.
I thought I understood the docs, but this is one part (multithreading program) which one has to be very careful if it has to be efficient.
Any suggestions?
I suggest you try using the
multiprocessingmodule, and use itsmap()method. This will let you use multiple cores efficiently.Python threading is not as efficient as it could be because of time-consuming locking within the Python interpreter. There is a
threadingmodule but you are probably better off with themultiprocessingmodule for map/reduce sort of problems.Also, if you want to make sure you understand map/reduce, why not play with a real map/reduce system? Hadoop is an available free-software map/reduce system and it is possible to use Python with Hadoop:
http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/