I try to build a “crawler” or a “atuomatic downloader” for each file is based on a webserver / webpage.
So in my oppinion there are two ways:
1) Directory Listing is enabled. Than its easy, read out the data that is in the listing and download every file you see.
2) Directory listing is disabled.
What then?
The only idea is have to brute force filenames and see the reaction of the server (e.g.: 404 for no file, 403 for a found directory, and data for the correct found data).
Is my idea right? Is there a better way?
You can always parse the HTML and look and follow (‘crawl’) the links you get. This the way most crawlers are implemented.
Check these libraries out that could help you do it:
.NET: Html Agility Pack
Python: Beautiful Soup
PHP: HTMLSimpleDom
ALWAYS look for robots.txt in the site’s root and make sure you respect the site’s rules on what pages are allowed to be be crawled.