I am using crawler4j to crawl a website. When I visit a page, I

Question

0

Asked: June 5, 20262026-06-05T13:39:16+00:00 2026-06-05T13:39:16+00:00

I am using crawler4j to crawl a website. When I visit a page, I

0

I am using crawler4j to crawl a website. When I visit a page, I would like to get the link text of all the links, not only the full URLs. Is this possible?

Thanks in advance.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-05T13:39:17+00:00

In the class where you derive from WebCrawler, get the contents of the page and then apply a regular expression.

Map<String, String> urlLinkText = new HashMap<String, String>();
String content = new String(page.getContentData(), page.getContentCharset());
Pattern pattern = Pattern.compile("<a[^>]*href=\"([^\"]*)\"[^>]*>([^<]*)</a[^>]*>", Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(content);
while (matcher.find()) {
    urlLinkText.put(matcher.group(1), matcher.group(2));
}

Then stick urlLinkText somewhere that you can get to it once your crawl is complete. For example you could make it a private member of your crawler class and add a getter.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am using crawler4j to crawl a website. When I visit a page, I

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply