In a follow up to my last question , if you have a string

Question

0

Asked: June 7, 20262026-06-07T16:38:44+00:00 2026-06-07T16:38:44+00:00

In a follow up to my last question , if you have a string

0

In a follow up to my last question, if you have a string that is malformed in an xml file, you can extract the contents using preg_replace_callback() to remove the elements that break.

The point of this function is not to parse the xml with regex (a bad
idea), but to try to find xml that doesn’t parse and where it fails so that we
can flag articles that aren’t being correctly formatted before being
sent out. This is part of a set of tools to clean content before
delivery. I am testing it on known malformed public RSS urls as well
as internal ones to see if it caters for a number of situations. The callback will return an integer for the node that failed. If it passes after that, we can report the index of the article and then try to use DOMDocument to try to correct the html and try again. If it fails, we’ll report it as a critical, otherwise, we return the parsing article description and content back to the database, marking it as modified before delivery.

You can then take the broken elements and run them through DOMDocument to format them better to return to the XML file.

However, I’m stuck on how to make this example below return other than false:

Sample XML:

<item>
    <content:encoded><![CDATA[
        This is the text with odd characters that are killing 
        simplexml_load_string() (doesn't recover) and breaking 
        (although recoverable) DOMDocument
    ]]></content:encoded>
</item>

If I use the following PHP, I can extract a description node and convert it from:

<description><![CDATA[
    This is some description text with the same problem
]]></description>

to

<description>0</description>

PHP:

preg_replace_callback(
    '/<description>(.*)<\/description>/', **// add msU modifiers to fix below**
    'node_tidy::callback_description',
    $xml
);

…

private function callback_description($matches=false) {
    if(false !== $matches) {
        $this->arrDescriptions[] = $matches[1];
        return '<description>'.$this->indexDescriptions++.'</description>';
    } else {
        return false;
    }
}

However, when I try to do the same with content:encoded nodes, it returns false. Here’s the related function:

private function callback_content_encoded($matches=false) {
    if(false !== $matches) {
        $this->arrContentEncoded[] = $matches[1];
        return '<content:encoded>'.$this->indexContentEncoded++.'</content:encoded>';
    } else {
        return false;
    }
}

Using a straight regex, to test if it’s the colon, I used this:

<?php

$string = '<content:encoded>this is some text</content:encoded>';
preg_match('/<content\:encoded>(.*)<\/content\:encoded>/',$string,$matches);

echo '<pre>';
print_r($matches);
echo '</pre>';

?>

However, that did not print the expected array with or without adding \:. Could someone point me in the right direction for the misunderstanding here?

Many thanks!

UPDATE:
Here’s a sample snippet of the real xml that fails, as indicated by @Florent.

http://pastebin.com/7z0f3MJP

UPDATE:
This regex matches the required content:

preg_match('/<content\:encoded>(.*)<\/content\:encoded>/msU',$string,$matches);

The m and s and U modifiers are explained better here:
http://www.php.net/manual/en/reference.pcre.pattern.modifiers.php

I neglected to consider these modifiers.

The results are now brought back by this regex, including the original problem, so this can now be resolved.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-07T16:38:45+00:00

Editorial Team

2026-06-07T16:38:45+00:00Added an answer on June 7, 2026 at 4:38 pm

You should add the following flags to your regex:

m to enable multiline strings
u to enable UTF8 strings (if necessary)

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

In a follow up to my last question , if you have a string

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply