Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 3452020
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 18, 20262026-05-18T09:12:32+00:00 2026-05-18T09:12:32+00:00

I have to resolve a problem close to parsing a huge file like, 3

  • 0

I have to resolve a problem close to parsing a huge file like, 3 GB or higher. Well, the file is structured how a pseudo xml file like:

<docFileNo_1>
<otherItems></otherItems>

<html>
<div=XXXpostag>
</html>

</docFileNo>
   ... others doc... 
<docFileNo_N>
<otherItems></otherItems>

<html>
<div=XXXpostag>
</html>

</docFileNo>

Surfing the net i have read about some people that have encountered problem to manage files, but they suggest to me, to map a file with NIO.
So i think that the solution is too expansive and could bring me thrown an exception. So i think that my problem is to resolve 2 doutbs:

  1. How to read efficiently in time
    the 3 GB text file
  2. How to parser
    efficiently the html extract from
    the docFileNoxx, and apply rules to
    the html’s tag to extract the post of
    the tag.

So.. I have try to resolve the first question on this way:

  1. _reader = new BufferedReader(new
    FileReader(filePath)) // create a
    buffer reader of file
  2. _currentLine = _reader.readLine();
    // i iterate the file reading it
    line by line
  3. For every line, i append the lines
    to a String variable until encounter
    the tag
  4. so with JSOUP and post CSS filter
    i extract the content, and put it on
    file.

Well the process of extraction of 25 MB, on average takes about 88 seconds….
So i would like to perform it.

HOw I could perform my extraction??

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-18T09:12:32+00:00Added an answer on May 18, 2026 at 9:12 am

    Whatever you do, don’t do (pseudo code):

    String data = "";
    for line in file {
        data += line;
    }
    

    but use a StringBuilder:

    StringBuilder data = new StringBuilder();
    for line in file {
        data.append(line);
    }
    return data.toString();
    

    Further, consider walking through the file and create a map with only the interesting parts.
    I assume you don’t have XML but something which only looks a bit like it, and the example you gave is a fair representation of the content.

    Map<String, String> entries = new HashMap<String,String>(1000);
    StringBuilder entryData = null;
    for line in file {
      if line starts with "<docFileNo" {
         docFileNo = extract number from line;
      } else if line starts with "<div=XXXpostag>" {
         // Content of this entry starts here
         entryData = new StringBuilder();
      } else if line starts with "</html>" {
         // content of this entry ends here
         // so store content, and indicate that the entry is finished by 
         // setting data to null
         entries.put(docFileNo, entryData.toString);
         entryData = null;
      } else if entryData is not null {
         // we're in an entry as data is not null, so store the line
         entryData.append(line);
      }
    }
    

    The map contains only entry-sized strings which makes them a bit easier to handle. I think you’d need to adapt it to the true data, but this is something which you could test in about half an hour.

    The clue is entryData. it is not only the StringBuilder in which the data of 1 entry is build, but if not-null it also indicates we saw a start entry marker (the div) and if null we saw the end marker (</html>) indicating the next lines need not be stored.

    I assumed you want to keep the doc number, and the XXXposttag is constant.

    An alternative implementation of this logic could be made using the Scanner class.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I could resolve this problem just now but assuming others might have the same
Ok so I've spent a couple hours trying to resolve this issue and have
I'm trying to work with sockets and I have such problem In code example:
I have got a memory leak problem in the example below(u can download the
I have a little problem with memory management in a Windows Service written in
I have a problem when I install 'Archive_Zip 0.1.1' on the Linux server, but
Our Windows file server has an archive service installed that stubs files that have
(resolved: see bottom) I have the following code snippet: Protected Sub SqlDataSource1_Inserted(ByVal sender As
Does anyone have a process or approach to use for determining how to resove
Have just started using Google Chrome , and noticed in parts of our site,

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.