I have written a web crawler in Java, and I am using Berkeley DB

Question

0

Asked: June 3, 20262026-06-03T07:06:08+00:00 2026-06-03T07:06:08+00:00

I have written a web crawler in Java, and I am using Berkeley DB

0

I have written a web crawler in Java, and I am using Berkeley DB to save the pages I crawl (for later indexing, etc.). I am storing each page as a Webpage object, which has the following instance fields:

@PrimaryKey
String url;
String docString;
Date lastVisited;
Date lastChecked;
ArrayList<String> stringLinks;

The largest field is the String docString, which is the entire HTML content (normally not more than 500KB even on a huge page) and stringLinks keeps a String for each of the outbound links on the page. That shouldn’t be too large, at most it’s 100 strings of length ~70 (not even a few KB).

I crawl a little faster than a page per second, sometimes 2 pages per second, and I am seeing the Berkeley Database grow to about 2-3MB per page, which is absolutely crazy given the data stored. The database stores the Webpages in an EntityStore, and I sync it periodically. No matter what I change, I can’t get the disk usage to go down!

This is a pretty big problem, because if I run multiple instances of the crawler (I have built it to be distributed) they will each quickly use a ton of disk space. If this is increasing linearly, I might be fine, but there is no way to tell by what function this space is ballooning. All i know its that it is many times the space of the actual data.

Is there something I am missing about EntityStore?

One thing to note is that I am both reading and writing from the DB, so I can’t set any flags to make it write only or something. And I would prefer not to increase the cache size much since this is a heap space sensitive environment.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-03T07:06:10+00:00

Editorial Team

2026-06-03T07:06:10+00:00Added an answer on June 3, 2026 at 7:06 am

The issue was with deferred write. I had to enable deferred write and then call env.sync() on a timer in order to keep the DB in check rather than call env.sync() on each put. The size decreased by a factor of more than 30…

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have written a web crawler in Java, and I am using Berkeley DB

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply