I’m trying to call a Java program (Stanford Chinese Word Segmenter) from within python. The Java program needs to load a large (100M) dictionary file (word list to assist segmentation) which takes 12+ seconds. I was wondering if it is possible to speed up the loading process, and more importantly, how to avoid loading it repeatedly when I need to call the python script multiple times?
Here’s the relevant part of the code:
op = subprocess.Popen(['java',
'-mx2g',
'-cp',
'seg.jar',
'edu.stanford.nlp.ie.crf.CRFClassifier',
'-sighanCorporaDict',
'data',
'-testFile',
filename,
'-inputEncoding',
'utf-8',
'-sighanPostProcessing',
'true',
'ctb',
'-loadClassifier',
**'./data/ctb.gz',**
'-serDictionary',
'./data/dict-chris6.ser.gz',
'0'],
stdout = subprocess.PIPE,
stdin = subprocess.PIPE,
stderr = subprocess.STDOUT,
)
In the above code, ‘./data/ctb.gz’ is the place where the large word list file is loaded. I think this might be related to process, but I don’t know much about it.
If the java program produces output as soon as it receives input from
filenamenamed pipe and you can’t change the java program then you could keep your Python script running instead and communicate with it via files/sockets as @DNA suggested for the Java process (the same idea but the Python program keeps running).