I am trying to make a so called text cleaner so that I could get rid of a few html elements without using the strip_tags() function.
My regex looks like this: <em>|</em>|<p[^>]*>|</p[^>]*>|<span[^>]*>|</span[^>]*>|<div[^>]*>|</div[^>]*>| |<table[^>]*>(.*?)</table[^>]*>
My code looks like this:
$string = "some very messy string here ";
$pattern = '<em>|</em>|<p[^>]*>|</p[^>]*>|<span[^>]*>|</span[^>]*>|<div[^>]*>|</div[^>]*>| |<table[^>]*>(.*?)</table[^>]*>';
$replace = ' ';
$clean = preg_replace($pattern, $replace, $string);
echo $clean;
For reasons that are beyond my understanding the echo returns nothing.
Thank you for your time
UPDATE #1
If you are asking if I want to get rid of the tables with all the content inside them the answer is yes.
Your regular expression needs delimiters. For example:
Read up on delimiters here.
Also note that some HTML specifications (all but XHTML as far as I know) allow uppercase tags, too. So consider adding the modifier for case-insensitivity to your regular expression. Furthermore, removing tables might not work if there are linebreaks between the opening and closing tags (because
.does not match line breaks by default). Add the DOTALL modifiersto solve this:One final note: as the others pointed out regex solutions to HTML problems should be taken with a grain of salt. Nested tables will cause issues, as will comments. If you know the data you are dealing with very well, the problem might be much less complex than general HTML. But be sure your code is at least valid and you know about all oddities like nested structures and HTML characters in comments and so on.