Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8827025
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 14, 20262026-06-14T07:17:03+00:00 2026-06-14T07:17:03+00:00

I am working in a project which require the use of collocations. I have

  • 0

I am working in a project which require the use of collocations. I have created the following code to extract them. The code takes a string and returns a list of the collocation patterns in this string. I have used Stanford POS to do the tagging.

I need your suggestion on the code, it seems very slow as I process huge amount of text.
Any suggestion to improve the code would be highly appreciated.

/**
*
*  A COLLOCATION is an expression consisting of two or more words that
*  correspond to some conventional way of saying things.
* 
*  I used the seventh Part-of-speech-tag patterns for collocation filtering that 
*  were suggested by Justeson and Katz(1995).
*  These patterns are:
* 
*  -----------------------------------------
*  |Tag |     Pattern Example              |
*  -----------------------------------------
*  |AN  | linear function                  |
*  |NN  | regression coefficients          |
*  |AAN | Gaussian random variable         |
*  |ANN | cumulative distribution function |
*  |NAN | mean squared error               |
*  |NNN | class probability function       |
*  |NPN | degrees of freedom               |                     
*  -----------------------------------------
*  Where A=adjective, P=preposition, & N=noun.
* 
*  Stanford POS have been used for the extraction process. 
*  see: http://nlp.stanford.edu/software/tagger.shtml#Download
* 
*  more on collocation:    http://nlp.stanford.edu/fsnlp/promo/colloc.pdf
*  more on POS:            http://acl.ldc.upenn.edu/J/J93/J93-2004.pdf
*  
*/

public class GetCollocations {
    public static ArrayList<String> GetCollocations(String text) throws IOException,                ClassNotFoundException{
       MaxentTagger tagger = new MaxentTagger("taggers/wsj-0-18-left3words.tagger");
       String[] tagged = tagger.tagString(text).split("\\s+");

       ArrayList<String> collocations = new ArrayList();
       for (int i = 0; i < tagged.length; i++) {

           String pot = tagged[i].substring(tagged[i].indexOf("_") + 1);
           if (pot.equals("NN") || pot.equals("NNS") || pot.equals("NNP") ||    pot.equals("NNPS")) {

               pot = tagged[i + 1].substring(tagged[i + 1].indexOf("_") + 1);
               if (pot.equals("NN") || pot.equals("NNS") || pot.equals("NNP") || pot.equals("NNPS")) {

                collocations.add(GetWordWithoutTag(tagged[i]) + " " + GetWordWithoutTag(tagged[i + 1]));

                pot = tagged[i + 2].substring(tagged[i + 2].indexOf("_") + 1);
                if (pot.equals("NN") || pot.equals("NNS") || pot.equals("NNP") || pot.equals("NNPS")) {
                    collocations.add(GetWordWithoutTag(tagged[i]) + " " + GetWordWithoutTag(tagged[i + 1]) + " " + GetWordWithoutTag(tagged[i + 2]));
                }

            } else if (pot.equals("JJ") || pot.equals("JJR") || pot.equals("JJS")) {
                pot = tagged[i + 2].substring(tagged[i + 2].indexOf("_") + 1);

                if (pot.equals("NN") || pot.equals("NNS") || pot.equals("NNP") || pot.equals("NNPS")) {
                    collocations.add(GetWordWithoutTag(tagged[i]) + " " + GetWordWithoutTag(tagged[i + 1]) + " " + GetWordWithoutTag(tagged[i + 2]));
                }

            } else if (pot.equals("IN")) {
                pot = tagged[i + 2].substring(tagged[i + 2].indexOf("_") + 1);

                if (pot.equals("NN") || pot.equals("NNS") || pot.equals("NNP") || pot.equals("NNPS")) {
                    collocations.add(GetWordWithoutTag(tagged[i]) + " " + GetWordWithoutTag(tagged[i + 1]) + " " + GetWordWithoutTag(tagged[i + 2]));
                }

            }


        } else if (pot.equals("JJ") || pot.equals("JJR") || pot.equals("JJS")) {
            pot = tagged[i + 1].substring(tagged[i + 1].indexOf("_") + 1);
            if (pot.equals("NN") || pot.equals("NNS") || pot.equals("NNP") || pot.equals("NNPS")) {
                collocations.add(GetWordWithoutTag(tagged[i]) + " " + GetWordWithoutTag(tagged[i + 1]));
                pot = tagged[i + 2].substring(tagged[i + 2].indexOf("_") + 1);
                if (pot.equals("NN") || pot.equals("NNS") || pot.equals("NNP") || pot.equals("NNPS")) {
                    collocations.add(GetWordWithoutTag(tagged[i]) + " " + GetWordWithoutTag(tagged[i + 1]) + " " + GetWordWithoutTag(tagged[i + 2]));
                }

            } else if (pot.equals("JJ") || pot.equals("JJR") || pot.equals("JJS")) {
                pot = tagged[i + 2].substring(tagged[i + 2].indexOf("_") + 1);
                if (pot.equals("NN") || pot.equals("NNS") || pot.equals("NNP") || pot.equals("NNPS")) {
                    collocations.add(GetWordWithoutTag(tagged[i]) + " " + GetWordWithoutTag(tagged[i + 1]) + " " + GetWordWithoutTag(tagged[i + 2]));
                }
            }

        }

    }
    return collocations;

}
public static String GetWordWithoutTag(String wordWithTag){
    String wordWithoutTag = wordWithTag.substring(0,wordWithTag.indexOf("_"));
    return wordWithoutTag;
}

}
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-14T07:17:05+00:00Added an answer on June 14, 2026 at 7:17 am

    If you are processing anywhere near 15,000 words per second then you are maxing out with the POS tagger. According to the Stanford Stanford POS tagger FAQ:

    on a 2008 nothing-special Intel server, it tags about 15000 words per second
    

    The rest of your algorithm appears fine, though if you really want to squeeze some juice out of it you could pre-allocate an Array as a static class variable instead of the ArrayList. Essentially sacrificing the upfront memory costs to not have to instantiate the ArrayList with each call or suffer the amortized O(n) cost of adding elements.

    Also just a suggestion on improving the readability of the code, you may consider using some private methods for checking what part of speech the pot variable is,

    private static Boolean  _isNoun(String pot) {
        if(pot.equals("NN") || pot.equals("NNS") || pot.equals("NNP") || pot.equals("NNPS")) return true;
        else return false;
    }
    
    private static Boolean _isAdjective(String pot){
        if(pot.equals("JJ") || pot.equals("JJR") || pot.equals("JJS")) return true;
        else return false;
    }
    

    Also if I’m not mistaking you should be able to simplify what you are doing, combining some of the if statements. This won’t really speed up your code but it will make it nicer to work with. Please go through this carefully, I have just tried to simplify your logic to demonstrate my point. Keep in mind the code below is UNTESTED:

    public static ArrayList<String> GetCollocations(String text) throws IOException,                ClassNotFoundException{
        MaxentTagger tagger = new MaxentTagger("taggers/wsj-0-18-left3words.tagger");
        String[] tagged = tagger.tagString(text).split("\\s+");
        ArrayList<String> collocations = new ArrayList();
    
        for (int i = 0; i < tagged.length; i++) {
            String pot = tagged[i].substring(tagged[i].indexOf("_") + 1);
    
            if (_isNoun(pot) || _isAdjective(pot)) {
                pot = tagged[i + 1].substring(tagged[i + 1].indexOf("_") + 1);
    
                if (_isNoun(pot) || _isAdjective(pot)) {
                    collocations.add(GetWordWithoutTag(tagged[i]) + " " + GetWordWithoutTag(tagged[i + 1]));
                    pot = tagged[i + 2].substring(tagged[i + 2].indexOf("_") + 1);
    
                    if (_isNoun(pot)) {
                        collocations.add(GetWordWithoutTag(tagged[i]) + " " + GetWordWithoutTag(tagged[i + 1]) + " " + GetWordWithoutTag(tagged[i + 2]));
                    }
    
                } else if (pot.equals("IN")) {
                    pot = tagged[i + 2].substring(tagged[i + 2].indexOf("_") + 1);
    
                    if (_isNoun(pot)) {
                        collocations.add(GetWordWithoutTag(tagged[i]) + " " + GetWordWithoutTag(tagged[i + 1]) + " " + GetWordWithoutTag(tagged[i + 2]));
                    }
    
                }
            }
        }
        return collocations;
    
    }
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I am working on a project for which we are required to use transaction
I have started working on a project which requires Natural Language Processing. We have
I am working on a project which requires the use of Google Maps and
Hi I'm working on a project which requires I use a large number of
I'm working on a couple apps which require the use of OpenGLes 2.0. I
I've been working on a rather large scale project which makes use of a
I'm working on a small hobby project for personal use, which requires a way
I was working on a project which requires to download content from a database.
Hello everyone i am working on a project which requires me to export some
I am working on a project report which requires some tricky output from a

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.