I am currently using PHPCrawler for some search functionality on a site. I need

Question

0

Asked: June 18, 20262026-06-18T03:08:58+00:00 2026-06-18T03:08:58+00:00

I am currently using PHPCrawler for some search functionality on a site. I need

0

I am currently using PHPCrawler for some search functionality on a site. I need to remove some of the page elements from being indexed.

For example, I have used:

$doc_body = preg_replace('/<li>(.*?)<\/li>/is', "", $doc_body);

to remove lists, because I don’t want the lists in the results. This works exactly as it should.

Now, another thing I need to remove is the following:

<div class="example">all contents within</div>

so for this I have tried:

   $doc_body = preg_replace('/<div(.*?)class="(.*?)example(.*?)"(.*?)>(.*?)<\/div>/is', "", $doc_body);

Which produces an error because perhaps not every page has the div class example.
So I have adapted it with the following code:

      if(strpos($doc_body,'<div class="example">')){
      $doc_body = preg_replace('/<div(.*?)class="(.*?)example(.*?)"(.*?)>(.*?)<\/div>/is', "", $doc_body);
      }

That unfortunately does not work either! It doesn’t produce an error, but it doesn’t remove and all contents from the results.

This is my first time working with either phpcrawler or Domdocument…although I am not sure if my problem here has anything to do with them?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-18T03:08:59+00:00

I’d suggest you take a look at DOMDocument and XPath which is used to query the document model much like CSS does, but with a bit different syntax. W3Schools have a lightweight tutorial on XPath here.

Regular expressions is always a bad idea when parsing an entire document since it is both resource heavy and time consuming.

E.g, to find every div with the class “example” using XPath, you’d just query the document as such;

//div[@class="example"]

Then remove the nodes with the DOMDocument api and finally normalize, in order to get the final result.

$doc = new DOMDocument();
$xpath = new DOMXPath($doc);
$doc->loadHTML($html);

// Remove all the lists
foreach ($xpath->query("//ul | //ol") as $node) {
     $node->parentNode->removeChild($node);
}

// Remove all <div class="example" /> nodes
foreach ($xpath->query("//div[@class='example']") as $node) {
     $node->parentNode->removeChild($node);
}

$doc->normalize();

// Get the final document for indexing
$html = $doc->saveHTML();

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am currently using PHPCrawler for some search functionality on a site. I need

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply