I am looking for a way to find all the web pages and sub domains in a domain. For example, in the uoregon.edu domain, I would like to find all the web pages in this domain and in all the sub domains (e.g., cs.uoregon.edu).
I have been looking at nutch, and I think it can do the job. But, it seems that nutch downloads entire web pages and indexes them for later search. But, I want a crawler that only scans a web page for URLs that belong to the same domain. Furthermore, it seems that nutch saves the linkdb in a serialized format. How can I read it? I tried solr, and it can read nutch’s collected data. But, I dont think I need solr, since I am not performing any searches. All I need are the URLs that belong to a given domain.
Thanks
If you’re familiar with ruby, consider using anemone. Wonderful crawling framework. Here is sample code that works out of the box.
https://github.com/chriskite/anemone
Disclaimer: You need to use a patch from the issues to crawl subdomains and you might want to consider adding a maximum page count.