I had an interesting task today and couldn’t find much on the subject. I wanted to share this, and ask for any suggestions on how this could have been done more elegantly. I consider myself a mediocre programmer who really wants to improve so any feedback is highly appreciated. There is also a strange bug I can’t figure out. So here goes..and hopefully this helps someone who ever has to do something similar.
A client was redoing a site, moving content around, and had a couple thousand redirects that needed to be made. Marketing sent me an XLS with old URLs in one column, new URLs in the next. These were the actions I took:
- Saved the XLS as CSV
Wrote a script which:
- Formatted the list as valid 301 redirects
- Exported the list to a text file
I then copy / pasted all the new directives into my .htaccess file.
Then, I wrote another script that checked to make sure each of the new links was valid (no 404s). The first script worked exactly as expected. For some reason, I can get the second script to print out all the 404 errors (there were several), but the script doesn’t die when it finishes traversing the loop, and it doesn’t write to the file, it just hangs in command line. No errors get reported. Any idea what’s going on? Here is the code for both scripts:
Formatting 301s:
<?php
$source = "301.csv";
$output = "301.txt";
//grab the contents of the source file as an array, prepare the output file for writing
$sourceArray = file($source);
$handleOutput = fopen($output, "w");
//Set the strings we want to replace in an array. The first array are the original lines and the second are the strings to be replaced
$originalLines = array(
'http://hipaasecurityassessment.com',
','
);
$replacementStrings = array(
'',
' '
);
//Split each item from the array into two strings, one which occurs before the comma and the other which occurs after
function setContent($sourceArray, $originalLines = array(), $replacementStrings = array()){
$outputArray = array();
$text = 'redirect 301 ';
foreach ($sourceArray as $number => $item){
$pattern = '/[,]/';
$item = preg_split($pattern, $item);
$item = array(
$item[0],
preg_replace('#"#', '', $item[1])
);
$item = implode(' ', $item);
$item = str_replace($originalLines, $replacementStrings, $item);
array_push($outputArray,$text,$item);
}
$outputString = implode('', $outputArray);
return $outputString;
}
//Invoke the set content function
$outputString = setContent($sourceArray, $originalLines, $replacementStrings);
//Finally, write to the text file!
fwrite($handleOutput, $outputString);
Checking for 404s:
<?php
$source = "301.txt";
$output = "print404.txt";
//grab the contents of the source file as an array, prepare the output file for writing
$sourceArray = file($source);
$handleOutput = fopen($output, "w");
//Split each item from the array into two strings, one which occurs before the space and the other which occurs after
function getUrls($sourceArray = array()){
$outputArray = array();
foreach ($sourceArray as $number => $item){
$item = str_replace('redirect 301', '', $item);
$pattern = '#[ ]+#';
$item = preg_split($pattern, $item);
$item = array(
$item[0],
$item[1],
$item[2]
);
array_push($outputArray, $item[2]);
}
return $outputArray;
}
//Check each URL for a 404 error via a curl request
function check404($url = array(), $handleOutput){
$handle = curl_init($url);
curl_setopt($handle, CURLOPT_RETURNTRANSFER, TRUE);
$content = curl_exec( $handle );
$response = curl_getinfo( $handle );
$httpCode = curl_getinfo($handle, CURLINFO_HTTP_CODE);
if($httpCode == 404) {
//fwrite($handleOutput, $url);
print $url;
}
};
$outputArray = getUrls($sourceArray);
foreach ($outputArray as $url)
{
$errors = check404($url, $handleOutput);
}
You should have used
fgetcsv()for generating the original URL list. This splits up CSV files into an array, simplifying the transformation.Can’t say anything about the 404s or the error cause. But using the wacky curl functions is almost always a bad indicator. For testing purposes I would have used a commandline tool like
wgetinstead so the results can be proof-checked manually.But maybe you could try PHPs own
get_headers()instead. It’s supposed to show the raw result headers; shouldn’t not follow redirects itself.