I have a sql query in MySQL and I want an expression that matches with the string not between ‘<‘ and ‘>’. For example:
select '<span class="boldtext">collaboratively site</span> – regardless of platform or language' rlike 'expression looking for boldtext' ==> should return false because 'boldtext' locates inside a html tag
select '<span class="boldtext">collaboratively site</span> – regardless of platform or language' rlike 'expression looking for platform' ==> should return true because 'platform' locates outside a html tag
I tried with below but no luck. I guess because the ‘*’ is greedy.
select '...' rlike '[^[.<.]]?[^[.>.]]*platform[^[.<.]]*[^[.>.]]?' # This expression doesn't work
I knew that the expression would be like below if it’s run on a programming language like Ruby or PHP
'<span class="boldtext">collaboratively site</span> – regardless of platform or language' =~ /((?!<[^>]*))\bboldtext\1/ # => false
'<span class="boldtext">collaboratively site</span> – regardless of platform or language' =~ /((?!<[^>]*))\bplatform\1/ # => true
I found a similar post but I can’t rewrite it for my case.
Could you help me how to come up with the expression that matches string not inside html tag purpose (run in mysql rlike operator) ?
Unfortunately, regular expressions cannot reliably parse infinite-descent languages like HTML. You will want to use a proper HTML parser for this, and I doubt MySQL contains one.
You might consider, if performing this operation in the DB is absolutely critical, creating another column that will contain only the textual representation of the HTML (again, using a proper parser to remove all of the tags) and set that when inserting/modifying the HTML itself. You will obviously need to keep them in sync, and this may be a pain, but it will simplify your queries immensely.