In a few different guises I’ve asked about this “filter” on here and WPSE. I’m now taking a different approach to it, and I’d like to make it solid and reliable.
My situation:
-
When I create a post in my WordPress CMS, I want to run a filter which searches for certain terms and replaces them with links.
-
I have the terms that I want to search for in two arrays:
$glossary_termsand$species_terms. -
$species_termsis a list of scientific names of fishes, such asApistogramma panduro. -
$glossary_termsis a list of fishkeeping glossary terms such asabdomen,caudal-finandGram's Method.
There are a few nuances worth noting:
-
Speed is not an issue, as I will be running this filter in the background rather than when a user visits the page or whan an author submits/edits a species profile or post.
-
Some of the post content being filtered may contain HTML with these terms in, like
<img src="image.jpg" title="Apistogramma panduro male" />. Obviously these shouldn’t be replaced. -
Species are often referred to with an abbreviated Genus, so instead of
Apistogramma panduro, you’ll often seeA. panduro. This means I need to search & replace all of the species terms as an abbreviation too –Apistogramma panduro,A. panduro,Satanoperca daemon,S. daemonetc. -
If
caudal-finandcaudalboth exist in the glossary terms,caudal-finshould be replaced first.
I was contemplating simply adding a preg_replace which searched for the terms, but only with a space on the left, (i.e. ( )term) and a space, comma, exclamation, full-stop or hyphen on the right (i.e. term(, . ! - )) but that won’t help me to not break the image HTML.
Example content
<br />
It looks very similar to fishes of the <i><a href="species/betta-foerschi" rel="species/betta-foerschi/?hover=true" class="link_species">B. foerschi</a></i> group/complex but its breeding strategy, adult size and observed behaviour preclude its inclusion in that <a href="glossary/a/assemblage" rel="glossary/a/assemblage?hover=true" class="link_glossary">assemblage</a>.
Instead it appears to be a member of the <i><a href="species/betta-coccina" rel="species/betta-coccina/?hover=true" class="link_species">B. coccina</a></i> group which currently includes <i><a href="species/betta-brownorum" rel="species/betta-brownorum/?hover=true" class="link_species">B. brownorum</a></i>, <i><a href="species/betta-burdigala" rel="species/betta-burdigala/?hover=true" class="link_species">B. burdigala</a></i>, <i><a href="species/betta-coccina" rel="species/betta-coccina/?hover=true" class="link_species">B. coccina</a></i>, <i><a href="species/betta-livida" rel="species/betta-livida/?hover=true" class="link_species">B. livida</a></i>, <i>B. miniopinna</i>, <i><a href="species/betta-persephone" rel="species/betta-persephone/?hover=true" class="link_species">B. persephone</a></i>, <i>B. tussyae</i>, <i><a href="species/betta-rutilans" rel="species/betta-rutilans/?hover=true" class="link_species">B. rutilans</a></i> and <i><a href="species/betta-uberis" rel="species/betta-uberis/?hover=true" class="link_species">B. uberis</a></i>.
Of these it's most similar in appearance to <i><a href="species/betta-uberis" rel="species/betta-uberis/?hover=true" class="link_species">B. uberis</a></i> but can be distinguished by its noticeably shorter <a href="glossary/d/dorsal" rel="glossary/d/dorsal?hover=true" class="link_glossary">dorsal</a>-<a href="glossary/f/fin" rel="glossary/f/fin?hover=true" class="link_glossary">fin</a> <a href="glossary/b/base" rel="glossary/b/base?hover=true" class="link_glossary">base</a> and overall blue-greenish (vs. green/reddish) colouration.
Members of this group are characterised by their small adult size (< 40 mm SL), a uniform red or black <a href="glossary/b/base" rel="glossary/b/base?hover=true" class="link_glossary">base</a> body colour, the presence of a <a href="glossary/m/midlateral" rel="glossary/m/midlateral?hover=true" class="link_glossary">midlateral</a> body blotch in some <a href="glossary/s/species" rel="glossary/s/species?hover=true" class="link_glossary">species</a> and the fact they have 9 abdominal <a href="glossary/v/vertebrae" rel="glossary/v/vertebrae?hover=true" class="link_glossary">vertebrae</a> compared with 10-12 in the other <a href="glossary/s/species" rel="glossary/s/species?hover=true" class="link_glossary">species</a> groups. In addition all are <a href="glossary/o/obligate" rel="glossary/o/obligate?hover=true" class="link_glossary">obligate</a> <a href="glossary/p/peat" rel="glossary/p/peat?hover=true" class="link_glossary">peat</a> <a href="glossary/s/swamp" rel="glossary/s/swamp?hover=true" class="link_glossary">swamp</a> dwellers (Tan and Ng, 2005).<br />
^^^ This example here has had the correct links manually inserted. The filter shouldn’t break these links!
It looks very similar to fishes of the B. foerschi group/complex but its breeding strategy, adult size and observed behaviour preclude its inclusion in that assemblage.
Instead it appears to be a member of the B. coccina group which currently includes B. brownorum, B. burdigala, B. coccina, B. livida, B. miniopinna, B. persephone, B. tussyae, B. rutilans and B. uberis.
Of these it's most similar in appearance to B. uberis but can be distinguished by its noticeably shorter dorsal-fin base and overall blue-greenish (vs. green/reddish) colouration.
Members of this group are characterised by their small adult size (< 40 mm SL), a uniform red or black base body colour, the presence of a midlateral body blotch in some species and the fact they have 9 abdominal vertebrae compared with 10-12 in the other species groups. In addition all are obligate peat swamp dwellers (Tan and Ng, 2005).
^^^ Same example pre-formatting.
<a href="http://www.seriouslyfish.comwp-content/uploads/2011/12/Amazonas-English-1.jpg"><img class="size-thumbnail wp-image-542" title="Amazonas English" src="/wp-content/uploads/2011/12/Amazonas-English-1-288x381.jpg" alt="Amazonas English" width="125" height="165" /></a>Amazonas Magazine - now in English!
Edited by Hans-Georg Evers, the magazine 'Amazonas' has been widely-regarded as among the finest regular publications in the hobby since its launch in 2005, an impressive achievment considering it's only been published in German to date. The long-awaited English version is just about to launch, and we think a subscription should be top of any serious fishkeeper's Xmas list...
The magazine is published in a bi-monthly basis and the English version launches with the January/February 2012 issue with distributors already organised in the United States, Canada, the United Kingdom, South Africa, Australia, and New Zealand. There are also mobile apps availablen which allow digital subscribers to read on portable devices.
It's fair to say that there currently exists no better publication for dedicated hobbyists with each issue featuring cutting-edge articles on fishes, invertebrates, aquatic plants, field trips to tropical destinations plus the latest in husbandry and breeding breakthroughs by expert aquarists, all accompanied by excellent photography throughout.
U.S. residents can subscribe to the printed edition for just $29 USD per year, which also includes a free digital subscription, with the same offer available to Canadian readers for $41 USD or overseas subscribers for $49 USD. Please see the <a href="http://www.amazonasmagazine.com/">Amazonas website</a> for further information and a sample digital issue!
Alternatively, subscribe directly to the print version <a href="https://www.amazonascustomerservice.com/subscribe/index2.php">here</a> or digital version <a href="https://www.amazonascustomerservice.com/subscribe/digital.php">here</a>.
^^^ This will likely only have a few Glossary terms in rather than any species links.
Example terms
$species_terms
339 => 'Aulonocara maylandi maylandi',
340 => 'Aulonocara maylandi kandeensis',
341 => 'Aulonocara sp. "walteri"',
342 => 'Aulonocara sp. "stuartgranti maleri"',
343 => 'Aulonocara stuartgranti',
344 => 'Benthochromis tricoti',
345 => 'Boulengerochromis microlepis',
346 => 'Buccochromis lepturus',
347 => 'Buccochromis nototaenia',
348 => 'Betta brownorum',
349 => 'Betta foerschi',
350 => 'Betta coccina',
351 => 'Betta uberis'
As you can see above, the general format for these scientific names is “Genus species”, but can often include “sp.” or “aff.” (for species which aren’t officially described) and “Genus species subspecies” formats.
$glossary_terms
1 => 'abdomen',
2 => 'caudal',
3 => 'caudal-fin',
4 => 'caudal-fin peduncle',
5 => 'Gram\'s Method'
If anyone can come up with a filter which meets all these conditions and requirements, I’d like to offer a bounty.
Thanks in advance,
I think it’s better to use DOMDocument functionality than regexps. Here is a working prototype:
Implementation details
I’ve only showed how to replace species terms, glossary terms will be same. Links are formed in form “species/$id”. Abbreviations are handled correctly.
DOMDocumentis a very reliable parser, it can deal with broken markup and is fast.?:in regexp allows not to count this subpattern as a capturing group (documentation on subpatterns). Without proper counting of subpatterns, we can’t retrieve thetermId. The idea is that we build a big regexp pattern by joining all regexps specified in$speciesTermsarray and separating them with a pipe|. Final regexp for the first two species would be (spaces for clarity):So, the text “Examples: Aulonocara maylandi maylandi, A. maylandi kandeensis” will give following matches:
We can clearly say that all elements in
matches[1]are referring to the speciesAulonocara maylandi maylandiorA. maylandi maylandiwhich has id = 339.In short: Use
(?:)if you’re using subpatterns in$speciesTerms.UPDATE
Each dynamically created regexp has a limit on maximal number of subpatterns, which is defined as a const at the top. This allows avoiding PCRE limit on number of subpatterns in regexp.
Important notes:
matchTerms, because regexp has a limit on a number of subpatterns. In this case it’s optimal to prebuild array of regexps out of every N terms.matchTermsgenerates regexp at every call, obviously it can be done only oncespeciesTermsstrlen=>mb_strlenif you’re using multibyte encodings$htmlwill be wrapped in a<body>tag (unless it’s already wrapped)