I am looking for a way to find all the web pages and sub

Question

0

Asked: June 2, 20262026-06-02T13:37:14+00:00 2026-06-02T13:37:14+00:00

I am looking for a way to find all the web pages and sub

0

I am looking for a way to find all the web pages and sub domains in a domain. For example, in the uoregon.edu domain, I would like to find all the web pages in this domain and in all the sub domains (e.g., cs.uoregon.edu).

I have been looking at nutch, and I think it can do the job. But, it seems that nutch downloads entire web pages and indexes them for later search. But, I want a crawler that only scans a web page for URLs that belong to the same domain. Furthermore, it seems that nutch saves the linkdb in a serialized format. How can I read it? I tried solr, and it can read nutch’s collected data. But, I dont think I need solr, since I am not performing any searches. All I need are the URLs that belong to a given domain.

Thanks

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-02T13:37:16+00:00

If you’re familiar with ruby, consider using anemone. Wonderful crawling framework. Here is sample code that works out of the box.

require 'anemone'

urls = []

Anemone.crawl(site_url)
  anemone.on_every_page do |page|
    urls << page.url
  end
end

https://github.com/chriskite/anemone

Disclaimer: You need to use a patch from the issues to crawl subdomains and you might want to consider adding a maximum page count.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am looking for a way to find all the web pages and sub

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply