I’m looking for a open-source web crawler written in Java which, in addition to

Question

0

Asked: June 16, 20262026-06-16T07:34:16+00:00 2026-06-16T07:34:16+00:00

I’m looking for a open-source web crawler written in Java which, in addition to

0

I’m looking for a open-source web crawler written in Java which, in addition to usual web crawler features such as depth/multi-threaded/etc. has the ability to custom handling each file type.

To be more precise, when a file is downloaded (or is going to be downloaded), I want to handle the saving operation of the files. The HTML files should be saved in a different repository, images to another location and other files somewhere else. Also, the repository could be not just a simple file system.

I’ve heard a lot about Apache Nutch. Does it have the ability to do this? I’m looking to achieve this as simple and fast as possible.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-16T07:34:17+00:00

Based on assumption that you want a lot of control over how crawler works, I would recommend crawler4j. There are many examples, so you can get quick glimpse of how things are working.

You could easily handle resources based on their content type (take a look at Page.java class – it is class of object that contains information about fetched resource).

There is no limitations regarding repository. You can use anything you wish.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m looking for a open-source web crawler written in Java which, in addition to

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply