I have some plain text and html. I need to create a PHP method that will return the same html, but with <span class="marked"> before any instances of the text and </span> after it.
Note, that it should support tags in the html (for example if the text is blabla so it should mark when it’s bla<b>bla</b> or <a href="http://abc.com">bla</a>bla.
It should be incase sensitive and support long text (with multilines etc) either.
For example, if I call this function with the text “my name is josh” and the following html:
<html>
<head>
<title>My Name Is Josh!!!</title>
</head>
<body>
<h1>my name is <b>josh</b></h1>
<div>
<a href="http://www.names.com">my name</a> is josh
</div>
<u>my</u> <i>name</i> <b>is</b> <span style="font-family: Tahoma;">Josh</span>.
</body>
</html>
… it should return:
<html>
<head>
<title><span class="marked">My Name Is Josh</span>!!!</title>
</head>
<body>
<h1><span class="marked">my name is <b>josh</b></span></h1>
<div>
<span class="marked"><a href="http://www.names.com">my name</a> is josh</span>
</div>
<span class="marked"><u>my</u> <i>name</i> <b>is</b> <span style="font-family: Tahoma;">Josh</span></span>.
</body>
</html>
Thanks.
This is going to be tricky.
Whilst you could do it with simple regex hacking, ignoring anything inside a tag, something like the naïve:
that’s not at all reliable. Partly because HTML can’t be parsed with regex: it’s valid to put
>in an attribute value, and other non-element constructs like comments will be mis-parsed. Even with a more rigorous expression to match tags — something horribly unwieldy like<[^>\s]*(\s+([^>\s]+(\s*=\s*([^"'\s>][\s>]*|"[^"]*"|'[^']*')\s*))?)*\s*\/?>, you’d still have many of the same problems, especially if the input HTML is not guaranteed valid.This could even be a security issue, as if the HTML you are processing is untrusted, it could fool your parser into turning text content into attributes, resulting in script injection.
But even ignoring that, you wouldn’t be able to ensure proper element nesting. So you might turn:
into the misnested and invalid:
or:
where those elements can’t be wrapped with a span. If you’re unlucky, the browser fixups to ‘correct’ your invalid output could end up leaving half the page ‘marked’, or messing up the page layout.
So you would have to do this on a parsed-DOM level rather than with string hacking. You could parse the whole string in using PHP, process it and re-serialise, but if it’s acceptable from an accessibility point of view, it would probably be easier to do it at the browser end in JavaScript, where the content is already parsed into DOM nodes.
It’s still going to be pretty hard. This question handles it where the text will all be inside the same text node, but that’s a much simpler case.
What you would effectively have to do would be:
Ouch.
Here’s an alternative suggestion which is slightly less nasty, if it’s acceptable to wrap every text node that is part of a match separately. So:
would leave you with the output:
which might look OK, depending on how you’re styling the matches. It would also solve the misnesting problem of matches partially inside elements.
ETA: Oh sod the pseudocode, I’ve more-or-less written the code now anyway, might as well finish it. Here’s a JavaScript version of the latter approach: