I have to resolve a problem close to parsing a huge file like, 3 GB or higher. Well, the file is structured how a pseudo xml file like:
<docFileNo_1>
<otherItems></otherItems>
<html>
<div=XXXpostag>
</html>
</docFileNo>
... others doc...
<docFileNo_N>
<otherItems></otherItems>
<html>
<div=XXXpostag>
</html>
</docFileNo>
Surfing the net i have read about some people that have encountered problem to manage files, but they suggest to me, to map a file with NIO.
So i think that the solution is too expansive and could bring me thrown an exception. So i think that my problem is to resolve 2 doutbs:
- How to read efficiently in time
the 3 GB text file - How to parser
efficiently the html extract from
the docFileNoxx, and apply rules to
the html’s tag to extract the post of
the tag.
So.. I have try to resolve the first question on this way:
- _reader = new BufferedReader(new
FileReader(filePath)) // create a
buffer reader of file - _currentLine = _reader.readLine();
// i iterate the file reading it
line by line - For every line, i append the lines
to a String variable until encounter
the tag - so with JSOUP and post CSS filter
i extract the content, and put it on
file.
Well the process of extraction of 25 MB, on average takes about 88 seconds….
So i would like to perform it.
HOw I could perform my extraction??
Whatever you do, don’t do (pseudo code):
but use a StringBuilder:
Further, consider walking through the file and create a map with only the interesting parts.
I assume you don’t have XML but something which only looks a bit like it, and the example you gave is a fair representation of the content.
The map contains only entry-sized strings which makes them a bit easier to handle. I think you’d need to adapt it to the true data, but this is something which you could test in about half an hour.
The clue is entryData. it is not only the StringBuilder in which the data of 1 entry is build, but if not-null it also indicates we saw a start entry marker (the div) and if null we saw the end marker
(</html>)indicating the next lines need not be stored.I assumed you want to keep the doc number, and the XXXposttag is constant.
An alternative implementation of this logic could be made using the Scanner class.