I am writing a code is a crawler but I want it to crawl

Question

0

Asked: May 23, 20262026-05-23T09:46:42+00:00 2026-05-23T09:46:42+00:00

I am writing a code is a crawler but I want it to crawl

0

I am writing a code is a crawler but I want it to crawl all the links that have the same base. For example if you set a big depth and you have a link in your page that links to your twitter, it will scan twitter and give you results like twitter.com/xxxyyyzzz.

What I want is to restrict the code to crawl only the urls that have the same base. I don’t mind if I set the domain again in a new variable.

Filtering the results and showing only the correct links at the end is not the appropriate way. Imagine if you find 1000 links and you just want the 10.

Thank you for the ideas.
(the correct code is in the answer)

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-23T09:46:42+00:00

MODIFIED

Try this on for size

function crawl_page($url, $depth = 2) {
    static $seen = array();
    if (isset($seen[$url]) || $depth == 0) {
        return;
    }
    
    $seen[$url] = true;
    $parts = parse_url($url);
    $dom = new DOMDocument('1.0');
    if (!$parts || !@$dom->loadHTMLFile($url)) {
        return;
    }
    
    $anchors = $dom->getElementsByTagName('a');
    foreach ($anchors as $anchor) {
        $href = $anchor->getAttribute('href');
        $path = false;
        if (0 !== strpos($href, 'http') && 0 !== strpos($href, 'javascript:')) {
            $path = '/' . ltrim($href, '/');
            if (extension_loaded('http')) {
                $path = http_build_url($url, array('path' => $path));
            }
            else {
                $href = "{$parts['scheme']}://";
                if (isset($parts['user'])) {
                    $href .= $parts['user'];
                    if (isset($parts['pass'])) {
                        $href .= ":{$parts['pass']}";
                    }
                    $href .= '@';
                }
                $href .= $parts['host'];
                if (isset($parts['port'])) {
                    $href .= ':' . $parts['port'];
                }
                $path = $href . $path;
            }
        }
        else {
            $href_parts = parse_url($href);
            if($href_parts['host'] == $parts['host'] && $href_parts['scheme'] == $parts['scheme']) {
                $path = $href;
            }
        }
        if (!empty($path) && $depth - 1 != 0) {
            crawl_page($path, $depth - 1);
        }
    }
    echo "Crawled: {$url}\n";
}

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am writing a code is a crawler but I want it to crawl

Leave an answerCancel reply

1 Answer

MODIFIED

Leave an answer
Cancel reply