I am implementing a web crawler and I am using Crawler4j library. I am

Question

0

Asked: June 6, 20262026-06-06T21:56:35+00:00 2026-06-06T21:56:35+00:00

I am implementing a web crawler and I am using Crawler4j library. I am

0

I am implementing a web crawler and I am using Crawler4j library. I am not getting all the links on a web site .
I tried to extract all the links on one page using Crawler4j and missed some links.

Crawler4j version : crawler4j-3.3

Url I used :http://testsite2012.site90.com/frontPage.html

No. of links on this page : almost 60 and 4-5 among them are repeating

No. of links crawler4j gave: 23

this is list of URLs on page and this is list of URLs given by Crawler4j.

I looked in ‘HtmlContentHandler.java’ file using by crawler4j to extract the links . In this only links associated with ‘src’ and ‘href’ links are being extracted .

I find the difference between these files. Crawler4j is missing the links which are not associated with ‘src’ or ‘href’ attribute and which are under the ‘script’ tag.
this is the list of links which crawler4j didn’t crawl .

How can i extract all the links on this page ?
Do I need to do string manipulation(like findding ‘http’ ) on HTML parsed page or should I change code of ‘HtmlContentHandler.java’ file ?

Which is best way ?

Even if I do string manipulation and extract all links on this page but Crawler4j is crawling the website using the links crawled by itself and won’t in such case it miss some pages ?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-06T21:56:39+00:00

Editorial Team

2026-06-06T21:56:39+00:00Added an answer on June 6, 2026 at 9:56 pm

Try using Regular Expressions to locate links.

You can look here for an example.

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am implementing a web crawler and I am using Crawler4j library. I am

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply