This is probably an easy question but I can’t find the answer… I have a PHP script named ‘send.php’ which makes a cURL request to open an external web page. It outputs the external page to the browser. All completely by the books.
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, $postdata);
curl_setopt($ch, CURLOPT_AUTOREFERER, 1);
curl_setopt($ch, CURLOPT_REFERER, $referer);
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_exec($ch);
All it does is posts some POST data to a processing script on an external site and then displays on the browser whatever that external script would display normally; ie, a confirmation message, thank you, etc.
Problem is: My ‘send.php’ is still the url that appears up in the navigation bar. So if I click around on that page, and the links are using relative paths, it tries to append my current path with those relative paths, which of course leads to a 404. Additionally, if there are more form fields on the page, and the action path is an empty string, it will try to post those submissions to send.php again on my server, which then generates errors.
How can I make it so it will still send the post data and output the result of the processing script but still allow the user to navigate the output page as they normally would? Or if it’s a multi-page form, they can continue filling out page 2 as if they were just on that site?
Thanks in advance
Update: Solved by adding these lines to the above code:
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$response = curl_exec($ch);
$url = curl_getinfo($ch, CURLINFO_EFFECTIVE_URL);
$response = str_ireplace('<head>', "<head><base href=\"$url\" />", $response);
echo $response;
You can get the URL that curl resolves to (if you’re using FOLLOWLOCATION with
curl_getinfoandCURLINFO_EFFECTIVE_URL. You can prepend this URL to all relative paths. As for how to tell whether a path is relative .. well .. if it starts with a ‘/’ it’s absolute, which actually makes it “relative” to the domain. If it starts with a scheme, it’s also absolute, and it may even lead to a different domain.As to how to actually find the URLs .. you could use
DOMDocument::loadHTMLand useDOMXPathto find all anchor tags (orlinks, if you like). Think about how much money Google engineers get paid for site scraping and URL following — this is probably not the simplest thing in the world to do optimally.