I have some serial code like this that computes word concordances i.e. counting collocated word pairs. The following program works except that the list of sentences is canned for illustrative purposes.
import sys
from collections import defaultdict
GLOBAL_CONCORDANCE = defaultdict(lambda: defaultdict(lambda: defaultdict(lambda: [])))
def BuildConcordance(sentences):
global GLOBAL_CONCORDANCE
for sentenceIndex, sentence in enumerate(sentences):
words = [word for word in sentence.split()]
for index, word in enumerate(words):
for i, collocate in enumerate(words[index:len(words)]):
GLOBAL_CONCORDANCE[word][collocate][i].append(sentenceIndex)
def main():
sentences = ["Sentence 1", "Sentence 2", "Sentence 3", "Sentence 4"]
BuildConcordance(sentences)
print GLOBAL_CONCORDANCE
if __name__ == "__main__":
main()
To me, the first for loop can be parallelized because the numbers being computed are indepedent. However, the data structure being modified is a global one.
I tried using Python’s Pool module but I am facing some pickling problems which makes me wonder if I am using the right design pattern. Can someone suggest a good way to parallelize this code?
In general, multiprocessing is easiest when you use a functional style. In this case, my suggestion would be to return a list of result tuples from each instance of the worker function. The extra complexity of the nested
defaultdicts doesn’t really gain you anything. Something like this:Let me know if that doesn’t generate what you’re looking for.