I am reading about 600 text files, and then parsing each file individually and

Question

0

Asked: May 25, 20262026-05-25T23:41:05+00:00 2026-05-25T23:41:05+00:00

I am reading about 600 text files, and then parsing each file individually and

0

I am reading about 600 text files, and then parsing each file individually and add all the terms to a map so i can know the frequency of each word within the 600 files. (about 400MB).

My parser functions includes the following steps (ordered):

find text between two tags, which is the relevant text to read in each file.
lowecase all the text
string.split with multiple delimiters.
creating an arrayList with words like this: “aaa-aa”, then adding to the string splitted above, and discounting “aaa” and “aa” to the String []. (i did this because i wanted “-” to be a delimiter, but i also wanted “aaa-aa” to be one word only, and not “aaa” and “aa”.
get the String [] and map to a Map = new HashMap … (word, frequency)
print everything.

It is taking me about 8min and 48 seconds, in a dual-core 2.2GHz, 2GB Ram. I would like advice on how to speed this process up. Should I expect it to be this slow? And if possible, how can I know (in netbeans), which functions are taking more time to execute?

unique words found: 398752.

CODE:

File file = new File(dir);
String[] files = file.list();

for (int i = 0; i < files.length; i++) {
    BufferedReader br = new BufferedReader(
        new InputStreamReader(
            new BufferedInputStream(
                new FileInputStream(dir + files[i])), encoding));
    try {
        String line;
        while ((line = br.readLine()) != null) {
            parsedString = parseString(line); // parse the string
            m = stringToMap(parsedString, m);
        }
    } finally {
        br.close();
    }
}

EDIT: Check this:

![enter image description here][1]

I don’t know what to conclude.

EDIT: 80% TIME USED WITH THIS FUNCTION

    public String [] parseString(String sentence){
         // separators; ,:;'"\/<>()[]*~^ºª+&%$ etc..
        String[] parts = sentence.toLowerCase().split("[,\\s\\-:\\?\\!\\«\\»\\'\\´\\`\\\"\\.\\\\\\/()<>*º;+&ª%\\[\\]~^]");

        Map<String, String> o = new HashMap<String, String>(); // save the hyphened words, aaa-bbb like Map<aaa,bbb>

        Pattern pattern = Pattern.compile("(?<![A-Za-zÁÉÍÓÚÀÃÂÊÎÔÛáéíóúàãâêîôû-])[A-Za-zÁÉÍÓÚÀÃÂÊÎÔÛáéíóúàãâêîôû]+-[A-Za-zÁÉÍÓÚÀÃÂÊÎÔÛáéíóúàãâêîôû]+(?![A-Za-z-])");
        Matcher matcher = pattern.matcher(sentence);

    // Find all matches like this: ("aaa-bb or bbb-cc") and put it to map to later add this words to the original map and discount the single words "aaa-aa" like "aaa" and "aa"
        for(int i=0; matcher.find(); i++){
           String [] tempo = matcher.group().split("-");
           o.put(tempo[0], tempo[1]);
        }
        //System.out.println("words: " + o);


        ArrayList temp = new ArrayList();
        temp.addAll(Arrays.asList(parts));

        for (Map.Entry<String, String> entry : o.entrySet()) {
            String key = entry.getKey();
            String value = entry.getValue();
            temp.add(key+"-"+value);
            if(temp.indexOf(key)!=-1){
                temp.remove(temp.indexOf(key));
            }
            if(temp.indexOf(value)!=-1){
                temp.remove(temp.indexOf(value));
            }
        }


        String []strArray = new String[temp.size()];
        temp.toArray(strArray);
                return strArray;

  }

600 files, each file about 0.5MB

EDIT3#- The pattern is no longer compiling each time a line is read. The new images are:

enter image description here

2: enter image description here

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-25T23:41:06+00:00

Be sure to increase your heap size, if you haven’t already, using -Xmx. For this app, the impact may be striking.

The parts of your code that are likely to have the largest performance impact are the ones that are executed the most – which are the parts you haven’t shown.

Update after memory screenshot

Look at all those Pattern$6 objects in the screenshot. I think you’re recompiling the pattern a lot – maybe for every line. That would take a lot of time.

Update 2 – after code added to question.

Yup – two patterns compiled on every line – the explicit one, and also the “-” in the split (much cheaper, of course). I wish they hadn’t added split() to String without it taking a compiled pattern as an argument. I see some other things that could be improved, but nothing else like the big compile. Just compile the pattern once, outside this function, maybe as a static class member.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am reading about 600 text files, and then parsing each file individually and

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply