I want to use regex to recognize space in the .pdf file name So

Question

0

Editorial Team

Asked: June 13, 20262026-06-13T17:00:57+00:00 2026-06-13T17:00:57+00:00

I want to use regex to recognize space in the .pdf file name So

0

I want to use regex to recognize space in the .pdf file name

So far i have been able to recognize src link to the file but it does not recognize the spaces in file name.

   <?php
   echo "<h1>Reading content from ITM website!</h1>";
   $ch = curl_init("http://domain.edu/index.php?option=com_content&view=article&id=58&Itemid=375&alias=lms");
   $fp = fopen("example_homepage.txt", "w");

   curl_setopt($ch, CURLOPT_FILE, $fp);
   curl_setopt($ch, CURLOPT_HEADER, 0);

   curl_exec($ch);
   curl_close($ch);
   $my_file="example_homepage.txt";
   $handle = fopen($my_file, 'rb');
   $data = fread($handle,filesize($my_file));

   $contents = strstr(file_get_contents('example_homepage.txt'), 'More quick links');
   $new_content = str_replace('<a href="', '<a href="http://www.domain.edu', $contents);
   $regex = '@((https?://)?([-\w]+\.[-\w\.]+)+\w(:\d+)?(/([-\w/_\.\,]*(\?\S+)?)?)*)@';
   $text = preg_replace($regex, '<a href="$1">$1</a>', $new_content);
   //echo $new_content;
   echo $text;
   fclose($fp);
   ?>

Current Output:

http://www.domain.edu/academiccalendar/Notice for final practical.pdf" target="_blank">Title

In this “Notice for final practical.pdf” does not appear as URL and just appears as text.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-13T17:00:58+00:00

Really, you should not use regex for screen scraping. It’s slow and eventually it will break. Instead, use a DOM parser or simply DOMDocument

<?php 
//curl bit
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, "http://itmindia.edu/index.php?option=com_content&view=article&id=58&Itemid=375&alias=lms");
curl_setopt($curl, CURLOPT_HEADER, 0);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_TIMEOUT, 30);
$site = curl_exec($curl);
curl_close($curl);



$dom = new DOMDocument();
@$dom->loadHTML($site);

$ret=array();
foreach($dom->getElementsByTagName('a') as $links) {
    //Is pdf
    if(substr($links->getAttribute('href'),-3) == 'pdf'){
        //Assign
        $url   = $links->getAttribute('href');
        $title = trim($links->nodeValue);
        $ret[]=array('url'=>'http://itmindia.edu'.$url,
                     'title'=>(empty($title)?basename($url):$title));
    }
}

print_r($ret);
/* Result
Array
(
    [0] => Array
        (
            [url] => http://itmindia.edu/images/ITM/pdf/ITMU bro june.pdf
            [title] => ITMU Brochure
        )

    [1] => Array
        (
            [url] => http://itmindia.edu/images/ITM/pdf/Report_2012_LR.pdf
            [title] => Annual Report to UGC July 2012
        )

    [2] => Array
        (
            [url] => http://itmindia.edu/admission2012/PhDwinter/Ph. D. application form 2012-13 for dec 2012 admission.pdf
            [title] => Application Form
        )

    [3] => Array
        (
            [url] => http://itmindia.edu/admission2012/PhDwinter/UF_Application_Form.pdf
            [title] => University Fellowship Form
        )
        ...
        ...
*/

//Then to output
foreach($ret as $v){
    echo '<a href="'.$v['url'].'" target="_blank">'.$v['title'].'</a>';
}
?>

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I want to use regex to recognize space in the .pdf file name So

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply