I know it’s better to use DOM for this purpose but let’s try to

Question

0

Asked: May 19, 20262026-05-19T22:12:58+00:00 2026-05-19T22:12:58+00:00

I know it’s better to use DOM for this purpose but let’s try to

0

I know it’s better to use DOM for this purpose but let’s try to extract the text in this way:

<?php


$html=<<<EOD
<html>
<head>
</head>
<body>
<p>Some text</p>
</body>
</html>
EOD;


        preg_match('/<body.*?>/', $html, $matches, PREG_OFFSET_CAPTURE);

        if (empty($matches))
            exit;

        $matched_body_start_tag = $matches[0][0];
        $index_of_body_start_tag = $matches[0][1];

        $index_of_body_end_tag = strpos($html, '</body>');


        $body = substr(
                        $html,
                        $index_of_body_start_tag + strlen($matched_body_start_tag),
                        $index_of_body_end_tag - $index_of_body_start_tag + strlen($matched_body_start_tag)
        );

echo $body;

The result can be seen here: http://ideone.com/vH2FZ

As you can see, I am getting more text than expected.

There is something I don’t understand, to get the correct length for the substr($string, $start, $length) function, I am using:

$index_of_body_end_tag - $index_of_body_start_tag + strlen($matched_body_start_tag)

I don’t see anything wrong with this formula.

Could somebody kindly suggest where the problem is?

Many thanks to you all.

EDIT:

Thank you very very much to all of you. There is just a bug in my brain. After reading your answers, I now understand what the problem is, it should either be:

  $index_of_body_end_tag - ($index_of_body_start_tag + strlen($matched_body_start_tag));

Or:

  $index_of_body_end_tag - $index_of_body_start_tag - strlen($matched_body_start_tag);

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-19T22:12:59+00:00

The problem is that your string have new lines where . in the pattern only matches single lines, you need to add /s modifier to make . to match multi-lines

Here is my solution, I prefer it this way.

<?php

$html=<<<EOD
<html>
<head>
</head>
<body buu="grger"     ga="Gag">
<p>Some text</p>
</body>
</html>
EOD;

    // get anything between <body> and </body> where <body can="have_as many" attributes="as required">
    if (preg_match('/(?:<body[^>]*>)(.*)<\/body>/isU', $html, $matches)) {
        $body = $matches[1];
    }
    // outputing all matches for debugging purposes
    var_dump($matches);
?>

Edit: I am updating my answer to provide you with better explanation why your code fails.

You have this string:

<html>
<head>
</head>
<body>
<p>Some text</p>
</body>
</html>

Everything seems to be fine with it but actually you have non-print characters (new line characters) on each line.
You have 53 printable characters and 7 non printable (new lines, \n == 2 characters actually for each new line).

When you reach this part of the code:

$index_of_body_end_tag = strpos($html, '</body>');

You get the correct position of </body> (starting at position 51) but this counts the new lines.

So when you reach this line of code:

$index_of_body_start_tag + strlen($matched_body_start_tag)

It it evaluated to 31 (new lines included), and:

$index_of_body_end_tag - $index_of_body_start_tag + strlen($matched_body_start_tag)

It is evaluated to 51 – 25 + 6 = 32 (characters you have to read) but you only have 16 printable characters of text between <body> and </body> and 4 non printable characters (new line after <body> and new line before </body>). And here is the problem, you have to group the calculation (prioritize) like so:

$index_of_body_end_tag - ($index_of_body_start_tag + strlen($matched_body_start_tag))

evaluated to 51 – (25 + 6) = 51 – 31 = 20 (16 + 4).

🙂 Hope this helps you to understand why prioritizing is important. (Sorry for misleading you about newlines it is only valid in regex example I gave above).

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I know it’s better to use DOM for this purpose but let’s try to

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply