I would like to run a str_replace or preg_replace which looks for certain words

Question

0

Asked: May 31, 20262026-05-31T01:11:07+00:00 2026-05-31T01:11:07+00:00

I would like to run a str_replace or preg_replace which looks for certain words

0

I would like to run a str_replace or preg_replace which looks for certain words (found in $glossary_terms) in my $content and replaces them with links (like <a href="/glossary/initial/term">term</a>).

However, the $content is full HTML and my links/images are being affected too, which isn’t what I’m after.

An example of $content is:

<div id="attachment_542" class="wp-caption alignleft" style="width: 135px"><a href="http://www.seriouslyfish.com/dev/wp-content/uploads/2011/12/Amazonas-English-1.jpg"><img class="size-thumbnail wp-image-542" title="Amazonas English" src="http://www.seriouslyfish.com/dev/wp-content/uploads/2011/12/Amazonas-English-1-288x381.jpg" alt="Amazonas English" width="125" height="165" /></a><p class="wp-caption-text">Amazonas Magazine - now in English!</p></div>
<p>Edited by Hans-Georg Evers, the magazine &#8216;Amazonas&#8217; has been widely-regarded as among the finest regular publications in the hobby since its launch in 2005, an impressive achievment considering it&#8217;s only been published in German to date. The long-awaited English version is just about to launch, and we think a subscription should be top of any serious fishkeeper&#8217;s Xmas list&#8230;</p>
<p>The magazine is published in a bi-monthly basis and the English version launches with the January/February 2012 issue with distributors already organised in the United States, Canada, the United Kingdom, South Africa, Australia, and New Zealand. There are also mobile apps availablen which allow digital subscribers to read on portable devices.</p>
<p>It&#8217;s fair to say that there currently exists no better publication for dedicated hobbyists with each issue featuring cutting-edge articles on fishes, invertebrates, aquatic plants, field trips to tropical destinations plus the latest in husbandry and breeding breakthroughs by expert aquarists, all accompanied by excellent photography throughout.</p>
<p>U.S. residents can subscribe to the printed edition for just $29 USD per year, which also includes a free digital subscription, with the same offer available to Canadian readers for $41 USD or overseas subscribers for $49 USD. Please see the <a href="http://www.amazonasmagazine.com/">Amazonas website</a> for further information and a sample digital issue!</p>
<p>Alternatively, subscribe directly to the print version <a href="https://www.amazonascustomerservice.com/subscribe/index2.php">here</a> or digital version <a href="https://www.amazonascustomerservice.com/subscribe/digital.php">here</a>. Just gonna add this to the end of the post so I can do some testing.</p>

I came across this link, but I wasn’t sure if such a method would work with nested HTML.

Is there any way I can str_replace or preg_replace content within <p> tags only; excluding any nested <a>, <img> or <h1/2/3/4/5> tags?

Thanks in advance,

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-31T01:11:08+00:00

A “by-the-book solution” would be like this:

<?php

$html = "<your HTML string>";
$glossary_terms = array('fishes', 'invertebrates', 'aquatic plants');

$dom = new DOMDocument;
$dom->loadHTML($html);

dom_link_glossary($dom, $glossary_terms);

echo $dom->saveHTML();

// wraps all occurrences of the glossary terms in links
function dom_link_glossary(&$document, &$glossary) {
  $xpath   = new DOMXPath($document);
  $urls    = array();
  $pattern = array();

  // build a normalized lookup (case-insensitive, whitespace-agnostic)
  foreach ($glossary as $term) {
    $term_norm = preg_replace('/\s+/', ' ', strtoupper(trim($term)));
    $pattern[] = preg_replace('/ /', '\\s+', preg_quote($term_norm));
    $urls[$term_norm] = '/glossary/initial/' . rawurlencode($term);
  }

  $pattern  = '/\b(' . implode('|', $pattern) . ')\b/i';
  $text_nodes = $xpath->query('//text()[not(ancestor::a)]');

  foreach($text_nodes as $original_node) {
    $text     = $original_node->nodeValue;
    $hitcount = preg_match_all($pattern, $text, $matches, PREG_OFFSET_CAPTURE);

    if ($hitcount == 0) continue;

    $offset   = 0;
    $parent   = $original_node->parentNode;
    $refnode  = $original_node->nextSibling;

    $parent->removeChild($original_node);

    foreach ($matches[0] as $i => $match) {
      $term_txt = $match[0];
      $term_pos = $match[1];
      $term_norm = preg_replace('/\s+/', ' ', strtoupper($term_txt));

      // insert any text before the term instance
      $prefix = substr($text, $offset, $term_pos - $offset);
      $parent->insertBefore($document->createTextNode($prefix), $refnode);

      // insert the actual term instance as a link
      $link = $document->createElement("a", $term_txt);
      $link->setAttribute("href", $urls[$term_norm]);
      $parent->insertBefore($link, $refnode);

      $offset = $term_pos + strlen($term_txt);

      if ($i == $hitcount - 1) {  // last match, append remaining text
        $suffix = substr($text, $offset);
        $parent->insertBefore($document->createTextNode($suffix), $refnode);
      }
    }
  }
}
?>

Here is how dom_link_glossary() works:

It normalizes the glossary terms (trim, uppercase, white-space) and builds a lookup array and a regex pattern that matches all terms.
It uses XPath to find all text nodes that are not already part of a link. Text nodes are returned irrespective of their nesting depth (i.e. no recursion necessary on our part). I use \b to prevent partial matches.
For each text node that contains terms:
- The original text node is deleted ($parent->removeChild())
- Now new nodes are created and inserted into the DOM: text nodes for anything before (or after) a glossary term, element nodes (<a>) for the actual glossary terms.

The solution preserves original case and white space, therefore

term will become <a href="/glossary/initial/term">term</a>
Term will become <a href="/glossary/initial/term">Term</a>
Foo Bar will become <a href="/glossary/initial/foo%20bar">Foo Bar</a>. Surplus whitespace or line breaks in the HTML will not break the mechanism.

Note that it is perfectly all-right to use regex on the plain text node values. It is not okay to use regex on full HTML.

I would recommend pairing the glossary terms with their respective URLs in an array, instead of calculating the URLs in the function. That way you can make multiple terms point to the same URL.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I would like to run a str_replace or preg_replace which looks for certain words

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply