I have just a PHP script for HTML parsing and it works on simple

Question

0

Asked: June 12, 20262026-06-12T16:37:46+00:00 2026-06-12T16:37:46+00:00

I have just a PHP script for HTML parsing and it works on simple

0

I have just a PHP script for HTML parsing and it works on simple web sites, but now I need to parse the cinema program from this website. I am using the file_get_contents function, which returns just 4 new line delimiters \n and I just can’t figure out why.
The website itself will be more difficult to parse with DOMDocument a XPath because the program itself is just pop-up window and it doesn’t seem to change the URL address but I will try to handle this problem after retrieving the HTML code of the site.

Here is the shortened version of my script:

<?php
      $url = "http://www.cinemacity.cz/";
      $content = file_get_contents($url);
      $dom = new DomDocument;
      $dom->loadHTML($content);

      if ($dom == FALSE) {
        echo "FAAAAIL\n";
      }

      $xpath = new DOMXPath($dom);

      $tags = $xpath->query("/html");

      foreach ($tags as $tag) {
        var_dump(trim($tag->nodeValue));
      }
?>

EDIT:

So, following the advice by WBAR (thank you), I was looking for a way how to change the header in file_get_contents() function a this is the answer I’ve found elsewhere. Now I am able to obtain the HTML of the site, hopefully I will manage parsing of this mess 😀

<?php
    libxml_use_internal_errors(true);
    // Create a stream
    $opts = array(
      'http'=>array(
        'user_agent' => 'PHP libxml agent', //Wget 1.13.4
        'method'=>"GET",
        'header'=>"Accept-language: en\r\n" .
                  "Cookie: foo=bar\r\n"
      )
    );
    $context = stream_context_create($opts);

    // Open the file using the HTTP headers set above
    $content = file_get_contents('http://www.cinemacity.cz/', false, $context);

    $dom = new DomDocument;
    $dom->loadHTML($content);

    if ($dom == FALSE) {
        echo "FAAAAIL\n";
    }

    $xpath = new DOMXPath($dom);

    $tags = $xpath->query("/html");

    foreach ($tags as $tag) {
        var_dump(trim($tag->nodeValue));
    }
?>

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-12T16:37:47+00:00

Editorial Team

2026-06-12T16:37:47+00:00Added an answer on June 12, 2026 at 4:37 pm

The problem is not in PHP but in target host. It detects client’s User-Agent header. Look at this:

wget http://www.cinemacity.cz/
2012-10-07 13:54:39 (1,44 MB/s) - saved `index.html.1' [234908]

but when remove User-Agent headers:

wget --user-agent="" http://www.cinemacity.cz/
2012-10-07 13:55:41 (262 KB/s) - saved `index.html.2' [4/4]

Only 4 bytes were returned by the server

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have just a PHP script for HTML parsing and it works on simple

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply