I need to migrate a website to a new CMS. We do not have access the original site except via http://mysite.com.
We currently have a variety of scripts that i). index the site and the ii). create some hierarchy and iii). scrape the unique content (ie. ignore header/ footer/ template etc).
The scripts actually work really quite well except the indexing the site. Is there a good utility that can index all the unique URLs of a site.
Currently we use a mixture of
$oHTML = new simple_html_dom();
$oHTML->setBody(file_get_contents('http://mysite.com'));
foreach($oHTML->find('a') as $oLink) {}
and a recursive function to hit all the unique links…
The question is… PHP is slow and hits memory limits fast… is this the right thing to do? Can I use sphinx or an opensource search engine or something to do it for me…
Or, after step 2
just run the indexer for sphinx re-index