The article segmentation have two kinds of cases:
1. < p > the first paragraph < / p > < p > the second paragraph < / p >...
2. < p > the first period of < br / > < br / > the second paragraph < br / > < br / > the third paragraph < / p >
I write the code as follows:
$body_arr = preg_split('/\<\/?p\>/',$body,-1,PREG_SPLIT_NO_EMPTY);
echo count($body_arr);
if(count($body_arr)<4)
{
$body_arr = preg_split('/(\<br\/?\>)\s*\\1/',$body,-1,PREG_SPLIT_NO_EMPTY);
$body1 = $body2 = $body3 = '';
$total = count($body_arr);
$maxed = max(floor($total / 2), 3);
foreach ($body_arr as $k => $v)
{
if ($k == 0)
{
$body1 = $v . "<br><br>";
}
else if ($k < $maxed)
{
$body2.=$v . "<br><br>";
}
else
{
$body3.=$v . "<br><br>" ;
}
}
}
-
It is the second
-
The result is wrong.
You can split the text with a single regex using nested groups. You’re starting with a p tag, followed by multiple paragraphs that end in either another close/open p tag, a pair of br tags, or a final close p tag.
The close/open p tag can be represented with the following:
The double br tag can be represented with the following:
And the close p tag can be represented with the following:
Note that I’m allowing for space between tags because you had it in your example, but remove the \s* if they’re not necessary. Stitch that together using some nested groups and you end up with something like this:
I tested that with your examples and it works. From the example I’m assuming that you don’t have tags in the middle of the paragraphs, but you’ll have to use something fancier than not the start of a tag to capture the actual text if that isn’t the case.