Im doing a frequency dictionary, in which i read 1000 files, each one with about 1000 lines. The approach i’m following is:
- BufferedReader to read fileByFile
- read the first file, get the first sentence, split the sentence to an array string, then fill in an hashmap with the values from the string array.
- do this for all the senteces in that file
- do this for all 1000 files
My problem is, this is not a very efficient way to do it, i’m taking about 4 minutes to do all this. I’v increased heap size, refactored the code to make sure i’m not doind something wrong. For this approach, i’m completly sure there’s nothing i can improve in the code.
My bet is, each time a sentece is read, a split is applied, which, multiplied by 1000 sentences in a file and by 1000 files is a huge ammount of splits to process.
My idea is, instead of read and process file-by-file, i could read each file to a char array, and then make the split only once per file. That would ease the ammount of processing times consuming with the split. Any suggestions of implementation would be appreciated.
OK, I have just implemented the POC of your dictionary. Fast and dirty. My files contained 868 lines each one but I created 1024 copies of the same file. (This is table of contents of Spring Framework documentation.)
I ran my test and it took 14020 ms (14 seconds!). BTW I ran it from eclipse that could decrease the speed a little bit.
So, I do not know where your problem is. Please try my code on your machine and if it runs faster try to compare it with your code and understand where the root problem.
Anyway my code is not the fastest I can write.
I can create Pattern before loop and the use it instead of String.split(). String.split() calls Pattern.compile() every time. Creating pattern is very expensive.
Here is the code: