My problem is that the following script works with some IRI’s and with others

Question

0

Asked: June 8, 20262026-06-08T03:36:44+00:00 2026-06-08T03:36:44+00:00

My problem is that the following script works with some IRI’s and with others

0

My problem is that the following script works with some IRI’s and with others not, and my question is why does it behave this way and how to solve it.
I think there is problem with the charset but it’s only a guess because in Wikipedia it works.

<?php
include('C:\xampp\htdocs\php\simple_html_dom.php');
$html = file_get_html('http://de.wikisource.org/wiki/Am_B%C3%A4chle');
//Titel
foreach($html->find('span#ws-title') as $f)
echo $f->plaintext;

//1   http://de.wikisource.org/wiki/7._August_1929           OK
//2   http://de.wikisource.org/wiki/%E2%80%99s_ist_Krieg!    -
//3   http://de.wikisource.org/wiki/Am_B%C3%A4chle           -
//4   http://de.wikipedia.org/wiki/Guillaume-Aff%C3%A4re     OK
//5   http://de.wikisource.org/wiki/Solidit%C3%A4t           -
?>

The 5 IRI’s are the examples. The last 3 IRI’s contain %C3%A4, it’s an “ä” but only the one from wikipedia works. The 2. IRI contains %E2%80%99 it’s an ” ’ ” – doesn’t work.

But the first IRI from wikisource works. The same for every IRI from wikisource which does not contain any ä, ö, …

When it does not work I get the following warning:

Warning: file_get_contents(http://de.wikisource.org/wiki/Solidit%C3%A4t): failed to open stream: HTTP request failed! HTTP/1.0 403 Forbidden in C:\xampp\htdocs\php\simple_html_dom.php on line 70

Fatal error: Call to a member function find() on a non-object in C:\xampp\htdocs\php\frage.php on line 5

The function which contains line 70 in simple_html_dom.php looks like that:

//65    function file_get_html($url, $use_include_path = false, $context=null, $offset = -1, $maxLen=-1, $lowercase = true, $forceTagsClosed=true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT)
//66    {
//67    // We DO force the tags to be terminated.
//68    $dom = new simple_html_dom(null, $lowercase, $forceTagsClosed, $target_charset, $defaultBRText);
//69    // For sourceforge users: uncomment the next line and comment the retreive_url_contents line 2 lines down if it is not already done.
//70    $contents = file_get_contents($url, $use_include_path, $context, $offset);
//71    // Paperg - use our own mechanism for getting the contents as we want to control the timeout.
//72    //    $contents = retrieve_url_contents($url);
//73    if (empty($contents))
//74    {
//75        return false;
//76    }
//77    // The second parameter can force the selectors to all be lowercase.
//78    $dom->load($contents, $lowercase, $stripRN);
//79    return $dom;
//80    }

Is there any way to get the script working for every IRI in Wikipedia or Wikisource? (I know that there is not always a span#ws-title, that’s not my problem.)

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-08T03:36:46+00:00

Awesome question! 🙂

They seem to filter by user agent, try something like

<?php
ini_set("user_agent", "Descriptive user agent string");
file_get_contents("http://de.wikisource.org/wiki/".urlencode("Am_Bächle"));
?>

you can probably skip the urlencode part since I just used it to test whether the encode was right.

Please note that wikisource obviously dislikes automated parsing of their content on the web pages. Nonetheless there might be an API available for wikibots and the like, ask them or search the community pages. The API will be much easier to handle anyway.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

My problem is that the following script works with some IRI’s and with others

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply