Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6640543
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 25, 20262026-05-25T23:41:05+00:00 2026-05-25T23:41:05+00:00

I am reading about 600 text files, and then parsing each file individually and

  • 0

I am reading about 600 text files, and then parsing each file individually and add all the terms to a map so i can know the frequency of each word within the 600 files. (about 400MB).

My parser functions includes the following steps (ordered):

  • find text between two tags, which is the relevant text to read in each file.
  • lowecase all the text
  • string.split with multiple delimiters.
  • creating an arrayList with words like this: “aaa-aa”, then adding to the string splitted above, and discounting “aaa” and “aa” to the String []. (i did this because i wanted “-” to be a delimiter, but i also wanted “aaa-aa” to be one word only, and not “aaa” and “aa”.
  • get the String [] and map to a Map = new HashMap … (word, frequency)
  • print everything.

It is taking me about 8min and 48 seconds, in a dual-core 2.2GHz, 2GB Ram. I would like advice on how to speed this process up. Should I expect it to be this slow? And if possible, how can I know (in netbeans), which functions are taking more time to execute?

unique words found: 398752.

CODE:

File file = new File(dir);
String[] files = file.list();

for (int i = 0; i < files.length; i++) {
    BufferedReader br = new BufferedReader(
        new InputStreamReader(
            new BufferedInputStream(
                new FileInputStream(dir + files[i])), encoding));
    try {
        String line;
        while ((line = br.readLine()) != null) {
            parsedString = parseString(line); // parse the string
            m = stringToMap(parsedString, m);
        }
    } finally {
        br.close();
    }
}

EDIT: Check this:

![enter image description here][1]

I don’t know what to conclude.


EDIT: 80% TIME USED WITH THIS FUNCTION

    public String [] parseString(String sentence){
         // separators; ,:;'"\/<>()[]*~^ºª+&%$ etc..
        String[] parts = sentence.toLowerCase().split("[,\\s\\-:\\?\\!\\«\\»\\'\\´\\`\\\"\\.\\\\\\/()<>*º;+&ª%\\[\\]~^]");

        Map<String, String> o = new HashMap<String, String>(); // save the hyphened words, aaa-bbb like Map<aaa,bbb>

        Pattern pattern = Pattern.compile("(?<![A-Za-zÁÉÍÓÚÀÃÂÊÎÔÛáéíóúàãâêîôû-])[A-Za-zÁÉÍÓÚÀÃÂÊÎÔÛáéíóúàãâêîôû]+-[A-Za-zÁÉÍÓÚÀÃÂÊÎÔÛáéíóúàãâêîôû]+(?![A-Za-z-])");
        Matcher matcher = pattern.matcher(sentence);

    // Find all matches like this: ("aaa-bb or bbb-cc") and put it to map to later add this words to the original map and discount the single words "aaa-aa" like "aaa" and "aa"
        for(int i=0; matcher.find(); i++){
           String [] tempo = matcher.group().split("-");
           o.put(tempo[0], tempo[1]);
        }
        //System.out.println("words: " + o);


        ArrayList temp = new ArrayList();
        temp.addAll(Arrays.asList(parts));

        for (Map.Entry<String, String> entry : o.entrySet()) {
            String key = entry.getKey();
            String value = entry.getValue();
            temp.add(key+"-"+value);
            if(temp.indexOf(key)!=-1){
                temp.remove(temp.indexOf(key));
            }
            if(temp.indexOf(value)!=-1){
                temp.remove(temp.indexOf(value));
            }
        }


        String []strArray = new String[temp.size()];
        temp.toArray(strArray);
                return strArray;

  }

600 files, each file about 0.5MB

EDIT3#- The pattern is no longer compiling each time a line is read. The new images are:

enter image description here

2: enter image description here

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-25T23:41:06+00:00Added an answer on May 25, 2026 at 11:41 pm

    Be sure to increase your heap size, if you haven’t already, using -Xmx. For this app, the impact may be striking.

    The parts of your code that are likely to have the largest performance impact are the ones that are executed the most – which are the parts you haven’t shown.

    Update after memory screenshot

    Look at all those Pattern$6 objects in the screenshot. I think you’re recompiling the pattern a lot – maybe for every line. That would take a lot of time.

    Update 2 – after code added to question.

    Yup – two patterns compiled on every line – the explicit one, and also the “-” in the split (much cheaper, of course). I wish they hadn’t added split() to String without it taking a compiled pattern as an argument. I see some other things that could be improved, but nothing else like the big compile. Just compile the pattern once, outside this function, maybe as a static class member.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

After reading about Mapper XMLs I can't help to wonder how one might go
While reading about wireless technology, I usually meet the term provisioning. Anyone can help
Reading about both separatedly, looks like the same, html+xml+javascript. What's the difference between then?
While reading about shell scripts and temporary file handling, I came across Symlink Exploits.
I have about 50-60 pdf files (images) that are 1.5MB large each. Now I
After reading about how file based PHP sessions are not the greatest for performance,
While reading about binary semaphore and mutex I found the following difference: Both can
While reading about memmove I read that it can handle MEMORY OVERLAPS but I
While reading about printf(),I found that it can print numbers as positive or negative
I am reading a text file into a database through EF4. This file has

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.