I am working in a project which require the use of collocations. I have created the following code to extract them. The code takes a string and returns a list of the collocation patterns in this string. I have used Stanford POS to do the tagging.
I need your suggestion on the code, it seems very slow as I process huge amount of text.
Any suggestion to improve the code would be highly appreciated.
/**
*
* A COLLOCATION is an expression consisting of two or more words that
* correspond to some conventional way of saying things.
*
* I used the seventh Part-of-speech-tag patterns for collocation filtering that
* were suggested by Justeson and Katz(1995).
* These patterns are:
*
* -----------------------------------------
* |Tag | Pattern Example |
* -----------------------------------------
* |AN | linear function |
* |NN | regression coefficients |
* |AAN | Gaussian random variable |
* |ANN | cumulative distribution function |
* |NAN | mean squared error |
* |NNN | class probability function |
* |NPN | degrees of freedom |
* -----------------------------------------
* Where A=adjective, P=preposition, & N=noun.
*
* Stanford POS have been used for the extraction process.
* see: http://nlp.stanford.edu/software/tagger.shtml#Download
*
* more on collocation: http://nlp.stanford.edu/fsnlp/promo/colloc.pdf
* more on POS: http://acl.ldc.upenn.edu/J/J93/J93-2004.pdf
*
*/
public class GetCollocations {
public static ArrayList<String> GetCollocations(String text) throws IOException, ClassNotFoundException{
MaxentTagger tagger = new MaxentTagger("taggers/wsj-0-18-left3words.tagger");
String[] tagged = tagger.tagString(text).split("\\s+");
ArrayList<String> collocations = new ArrayList();
for (int i = 0; i < tagged.length; i++) {
String pot = tagged[i].substring(tagged[i].indexOf("_") + 1);
if (pot.equals("NN") || pot.equals("NNS") || pot.equals("NNP") || pot.equals("NNPS")) {
pot = tagged[i + 1].substring(tagged[i + 1].indexOf("_") + 1);
if (pot.equals("NN") || pot.equals("NNS") || pot.equals("NNP") || pot.equals("NNPS")) {
collocations.add(GetWordWithoutTag(tagged[i]) + " " + GetWordWithoutTag(tagged[i + 1]));
pot = tagged[i + 2].substring(tagged[i + 2].indexOf("_") + 1);
if (pot.equals("NN") || pot.equals("NNS") || pot.equals("NNP") || pot.equals("NNPS")) {
collocations.add(GetWordWithoutTag(tagged[i]) + " " + GetWordWithoutTag(tagged[i + 1]) + " " + GetWordWithoutTag(tagged[i + 2]));
}
} else if (pot.equals("JJ") || pot.equals("JJR") || pot.equals("JJS")) {
pot = tagged[i + 2].substring(tagged[i + 2].indexOf("_") + 1);
if (pot.equals("NN") || pot.equals("NNS") || pot.equals("NNP") || pot.equals("NNPS")) {
collocations.add(GetWordWithoutTag(tagged[i]) + " " + GetWordWithoutTag(tagged[i + 1]) + " " + GetWordWithoutTag(tagged[i + 2]));
}
} else if (pot.equals("IN")) {
pot = tagged[i + 2].substring(tagged[i + 2].indexOf("_") + 1);
if (pot.equals("NN") || pot.equals("NNS") || pot.equals("NNP") || pot.equals("NNPS")) {
collocations.add(GetWordWithoutTag(tagged[i]) + " " + GetWordWithoutTag(tagged[i + 1]) + " " + GetWordWithoutTag(tagged[i + 2]));
}
}
} else if (pot.equals("JJ") || pot.equals("JJR") || pot.equals("JJS")) {
pot = tagged[i + 1].substring(tagged[i + 1].indexOf("_") + 1);
if (pot.equals("NN") || pot.equals("NNS") || pot.equals("NNP") || pot.equals("NNPS")) {
collocations.add(GetWordWithoutTag(tagged[i]) + " " + GetWordWithoutTag(tagged[i + 1]));
pot = tagged[i + 2].substring(tagged[i + 2].indexOf("_") + 1);
if (pot.equals("NN") || pot.equals("NNS") || pot.equals("NNP") || pot.equals("NNPS")) {
collocations.add(GetWordWithoutTag(tagged[i]) + " " + GetWordWithoutTag(tagged[i + 1]) + " " + GetWordWithoutTag(tagged[i + 2]));
}
} else if (pot.equals("JJ") || pot.equals("JJR") || pot.equals("JJS")) {
pot = tagged[i + 2].substring(tagged[i + 2].indexOf("_") + 1);
if (pot.equals("NN") || pot.equals("NNS") || pot.equals("NNP") || pot.equals("NNPS")) {
collocations.add(GetWordWithoutTag(tagged[i]) + " " + GetWordWithoutTag(tagged[i + 1]) + " " + GetWordWithoutTag(tagged[i + 2]));
}
}
}
}
return collocations;
}
public static String GetWordWithoutTag(String wordWithTag){
String wordWithoutTag = wordWithTag.substring(0,wordWithTag.indexOf("_"));
return wordWithoutTag;
}
}
If you are processing anywhere near 15,000 words per second then you are maxing out with the POS tagger. According to the Stanford Stanford POS tagger FAQ:
The rest of your algorithm appears fine, though if you really want to squeeze some juice out of it you could pre-allocate an Array as a static class variable instead of the ArrayList. Essentially sacrificing the upfront memory costs to not have to instantiate the ArrayList with each call or suffer the amortized O(n) cost of adding elements.
Also just a suggestion on improving the readability of the code, you may consider using some private methods for checking what part of speech the
potvariable is,Also if I’m not mistaking you should be able to simplify what you are doing, combining some of the
ifstatements. This won’t really speed up your code but it will make it nicer to work with. Please go through this carefully, I have just tried to simplify your logic to demonstrate my point. Keep in mind the code below is UNTESTED: