Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 954397
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 16, 20262026-05-16T00:12:38+00:00 2026-05-16T00:12:38+00:00

I’m working on a web crawler (please don’t suggest an existing one, it’s not

  • 0

I’m working on a web crawler (please don’t suggest an existing one, it’s not an option).
I have it working the way it is expected to. My only issue is that currently I’m using a sort of server/client model, where by the server does the crawling and processes the data, it then puts it in a central location.

This location is an object created from a class I wrote. Internally the class maintains a HashMap defined as HashMap<String, HashMap<String, String>>

I store data in the map with URL as a key (I keep these unique) and the HashMap value stores the corresponding data fields for that URL such as title or value.

I occasionally serialize the internal objects used, but the spider is multi threaded and as soon as I have say 5 threads crawling, the memory requirements go up exponentially.

So far the performance has been excellent with the HashMap, crawling 15K URLs in 2.r minutes with about 30 seconds CPU time so I really don’t need to be pointed in the direction of an existing spider like most forum users have suggested.

Can anyone suggest a fast disc based solution that will probably support concurrent reading & writing? The data structure doesn’t have to be the same, just needs to be able to store related meta tag values together etc.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-16T00:12:39+00:00Added an answer on May 16, 2026 at 12:12 am

    I suggest using EhCache for this, even though what you’re building isn’t really a cache. EhCache allows you to configure the cache instance so that it overflows to disc storage, while keeping the most recent items in memory. It can also be configured to be disc-persistent, i.e. data is flushed to disc on shutdown, and read back into memory at startup. On top of all that, it’s key-value based, so it already fits your model. It supports concurrent access, and since the disk storage is managed as a separate thread, you shouldn’t need to worry about disk access concurrency.

    Alternatively, you could consider a proper embedded database such as Hypersonic (or numerous others of a similar style), but that’s probably going to be more work.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

No related questions found

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.