Below is a link crawler that gets the urls of a page in a

Question

0

Asked: May 23, 20262026-05-23T09:00:09+00:00 2026-05-23T09:00:09+00:00

Below is a link crawler that gets the urls of a page in a

0

Below is a link crawler that gets the urls of a page in a given depth. At the end of it I added a regular expression to match all the emails of the url that is just crawled. As you can see in the second part, it file_get_content the same page it just downloaded, meaning twice the execution time, bandwidth etc.

The question is how can I merge those two parts to use the first downloaded page, to avoid getting it again? Thank you.

function crawler($url, $depth = 2) {

    $dom = new DOMDocument('1.0');
    if (!$parts || !@$dom->loadHTMLFile($url)) {
        return;
    }
.
.
.

//this is where the second part starts

  $text = file_get_contents($url);
  $res = preg_match_all("/[a-z0-9]+([_\\.-][a-z0-9]+)*@([a-z0-9]+([\.-][a-z0-9]+)*)+\\.[a-z]{2,}/i", $text, $matches);

}

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-23T09:00:10+00:00

Editorial Team

2026-05-23T09:00:10+00:00Added an answer on May 23, 2026 at 9:00 am

Replace:

$text = file_get_contents($url);

with:

$text = $dom->saveHTML();

http://www.php.net/manual/en/domdocument.savehtml.php

Alternatively, in the first part of your function, you could save the HTML into a variable using file_get_contents, then pass it to $dom->loadHTML. That way you can then reuse the variable with your regex.

http://www.php.net/manual/en/domdocument.loadhtml.php

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Below is a link crawler that gets the urls of a page in a

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply