I want to use regex to recognize space in the .pdf file name
So far i have been able to recognize src link to the file but it does not recognize the spaces in file name.
<?php
echo "<h1>Reading content from ITM website!</h1>";
$ch = curl_init("http://domain.edu/index.php?option=com_content&view=article&id=58&Itemid=375&alias=lms");
$fp = fopen("example_homepage.txt", "w");
curl_setopt($ch, CURLOPT_FILE, $fp);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_exec($ch);
curl_close($ch);
$my_file="example_homepage.txt";
$handle = fopen($my_file, 'rb');
$data = fread($handle,filesize($my_file));
$contents = strstr(file_get_contents('example_homepage.txt'), 'More quick links');
$new_content = str_replace('<a href="', '<a href="http://www.domain.edu', $contents);
$regex = '@((https?://)?([-\w]+\.[-\w\.]+)+\w(:\d+)?(/([-\w/_\.\,]*(\?\S+)?)?)*)@';
$text = preg_replace($regex, '<a href="$1">$1</a>', $new_content);
//echo $new_content;
echo $text;
fclose($fp);
?>
Current Output:
http://www.domain.edu/academiccalendar/Notice for final practical.pdf" target="_blank">Title
In this “Notice for final practical.pdf” does not appear as URL and just appears as text.
Really, you should not use regex for screen scraping. It’s slow and eventually it will break. Instead, use a DOM parser or simply DOMDocument