I have to resolve a problem close to parsing a huge file like, 3

Question

0

Asked: May 18, 20262026-05-18T09:12:32+00:00 2026-05-18T09:12:32+00:00

I have to resolve a problem close to parsing a huge file like, 3

0

I have to resolve a problem close to parsing a huge file like, 3 GB or higher. Well, the file is structured how a pseudo xml file like:

<docFileNo_1>
<otherItems></otherItems>

<html>
<div=XXXpostag>
</html>

</docFileNo>
   ... others doc... 
<docFileNo_N>
<otherItems></otherItems>

<html>
<div=XXXpostag>
</html>

</docFileNo>

Surfing the net i have read about some people that have encountered problem to manage files, but they suggest to me, to map a file with NIO.
So i think that the solution is too expansive and could bring me thrown an exception. So i think that my problem is to resolve 2 doutbs:

How to read efficiently in time
the 3 GB text file
How to parser
efficiently the html extract from
the docFileNoxx, and apply rules to
the html’s tag to extract the post of
the tag.

So.. I have try to resolve the first question on this way:

_reader = new BufferedReader(new
FileReader(filePath)) // create a
buffer reader of file
_currentLine = _reader.readLine();
// i iterate the file reading it
line by line
For every line, i append the lines
to a String variable until encounter
the tag
so with JSOUP and post CSS filter
i extract the content, and put it on
file.

Well the process of extraction of 25 MB, on average takes about 88 seconds….
So i would like to perform it.

HOw I could perform my extraction??

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-18T09:12:32+00:00

Whatever you do, don’t do (pseudo code):

String data = "";
for line in file {
    data += line;
}

but use a StringBuilder:

StringBuilder data = new StringBuilder();
for line in file {
    data.append(line);
}
return data.toString();

Further, consider walking through the file and create a map with only the interesting parts.
I assume you don’t have XML but something which only looks a bit like it, and the example you gave is a fair representation of the content.

Map<String, String> entries = new HashMap<String,String>(1000);
StringBuilder entryData = null;
for line in file {
  if line starts with "<docFileNo" {
     docFileNo = extract number from line;
  } else if line starts with "<div=XXXpostag>" {
     // Content of this entry starts here
     entryData = new StringBuilder();
  } else if line starts with "</html>" {
     // content of this entry ends here
     // so store content, and indicate that the entry is finished by 
     // setting data to null
     entries.put(docFileNo, entryData.toString);
     entryData = null;
  } else if entryData is not null {
     // we're in an entry as data is not null, so store the line
     entryData.append(line);
  }
}

The map contains only entry-sized strings which makes them a bit easier to handle. I think you’d need to adapt it to the true data, but this is something which you could test in about half an hour.

The clue is entryData. it is not only the StringBuilder in which the data of 1 entry is build, but if not-null it also indicates we saw a start entry marker (the div) and if null we saw the end marker (</html>) indicating the next lines need not be stored.

I assumed you want to keep the doc number, and the XXXposttag is constant.

An alternative implementation of this logic could be made using the Scanner class.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have to resolve a problem close to parsing a huge file like, 3

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply