Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 3874998
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 19, 20262026-05-19T22:12:58+00:00 2026-05-19T22:12:58+00:00

I know it’s better to use DOM for this purpose but let’s try to

  • 0

I know it’s better to use DOM for this purpose but let’s try to extract the text in this way:

<?php


$html=<<<EOD
<html>
<head>
</head>
<body>
<p>Some text</p>
</body>
</html>
EOD;


        preg_match('/<body.*?>/', $html, $matches, PREG_OFFSET_CAPTURE);

        if (empty($matches))
            exit;

        $matched_body_start_tag = $matches[0][0];
        $index_of_body_start_tag = $matches[0][1];

        $index_of_body_end_tag = strpos($html, '</body>');


        $body = substr(
                        $html,
                        $index_of_body_start_tag + strlen($matched_body_start_tag),
                        $index_of_body_end_tag - $index_of_body_start_tag + strlen($matched_body_start_tag)
        );

echo $body;

The result can be seen here: http://ideone.com/vH2FZ

As you can see, I am getting more text than expected.

There is something I don’t understand, to get the correct length for the substr($string, $start, $length) function, I am using:

$index_of_body_end_tag - $index_of_body_start_tag + strlen($matched_body_start_tag)

I don’t see anything wrong with this formula.

Could somebody kindly suggest where the problem is?

Many thanks to you all.

EDIT:

Thank you very very much to all of you. There is just a bug in my brain. After reading your answers, I now understand what the problem is, it should either be:

  $index_of_body_end_tag - ($index_of_body_start_tag + strlen($matched_body_start_tag));

Or:

  $index_of_body_end_tag - $index_of_body_start_tag - strlen($matched_body_start_tag);
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-19T22:12:59+00:00Added an answer on May 19, 2026 at 10:12 pm

    The problem is that your string have new lines where . in the pattern only matches single lines, you need to add /s modifier to make . to match multi-lines

    Here is my solution, I prefer it this way.

    <?php
    
    $html=<<<EOD
    <html>
    <head>
    </head>
    <body buu="grger"     ga="Gag">
    <p>Some text</p>
    </body>
    </html>
    EOD;
    
        // get anything between <body> and </body> where <body can="have_as many" attributes="as required">
        if (preg_match('/(?:<body[^>]*>)(.*)<\/body>/isU', $html, $matches)) {
            $body = $matches[1];
        }
        // outputing all matches for debugging purposes
        var_dump($matches);
    ?>
    

    Edit: I am updating my answer to provide you with better explanation why your code fails.

    You have this string:

    <html>
    <head>
    </head>
    <body>
    <p>Some text</p>
    </body>
    </html>
    

    Everything seems to be fine with it but actually you have non-print characters (new line characters) on each line.
    You have 53 printable characters and 7 non printable (new lines, \n == 2 characters actually for each new line).

    When you reach this part of the code:

    $index_of_body_end_tag = strpos($html, '</body>');
    

    You get the correct position of </body> (starting at position 51) but this counts the new lines.

    So when you reach this line of code:

    $index_of_body_start_tag + strlen($matched_body_start_tag)
    

    It it evaluated to 31 (new lines included), and:

    $index_of_body_end_tag - $index_of_body_start_tag + strlen($matched_body_start_tag)
    

    It is evaluated to 51 – 25 + 6 = 32 (characters you have to read) but you only have 16 printable characters of text between <body> and </body> and 4 non printable characters (new line after <body> and new line before </body>). And here is the problem, you have to group the calculation (prioritize) like so:

    $index_of_body_end_tag - ($index_of_body_start_tag + strlen($matched_body_start_tag))
    

    evaluated to 51 – (25 + 6) = 51 – 31 = 20 (16 + 4).

    🙂 Hope this helps you to understand why prioritizing is important. (Sorry for misleading you about newlines it is only valid in regex example I gave above).

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

know nothing about php, but I have this script that reads a folder and
Know this might be rather basic, but I been trying to figure out how
I know this question was asked before but I have the following simple code
Does anyone know how can I replace this 2 symbol below from the string
Know of any good libraries for this? I did some searches and didn't come
Know of a way to mock %[]? I'm writing tests for code that makes
I know, it's possible to define a QObject with custom properties and expose this
I know I can select the lines and use something like :w ! sort
I know only what I need but I do not know how to get
Is it possible to replace javascript w/ HTML if JavaScript is not enabled on

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.