My problem is that the following script works with some IRI’s and with others not, and my question is why does it behave this way and how to solve it.
I think there is problem with the charset but it’s only a guess because in Wikipedia it works.
<?php
include('C:\xampp\htdocs\php\simple_html_dom.php');
$html = file_get_html('http://de.wikisource.org/wiki/Am_B%C3%A4chle');
//Titel
foreach($html->find('span#ws-title') as $f)
echo $f->plaintext;
//1 http://de.wikisource.org/wiki/7._August_1929 OK
//2 http://de.wikisource.org/wiki/%E2%80%99s_ist_Krieg! -
//3 http://de.wikisource.org/wiki/Am_B%C3%A4chle -
//4 http://de.wikipedia.org/wiki/Guillaume-Aff%C3%A4re OK
//5 http://de.wikisource.org/wiki/Solidit%C3%A4t -
?>
The 5 IRI’s are the examples. The last 3 IRI’s contain %C3%A4, it’s an “ä” but only the one from wikipedia works. The 2. IRI contains %E2%80%99 it’s an ” ’ ” – doesn’t work.
But the first IRI from wikisource works. The same for every IRI from wikisource which does not contain any ä, ö, …
When it does not work I get the following warning:
Warning: file_get_contents(http://de.wikisource.org/wiki/Solidit%C3%A4t): failed to open stream: HTTP request failed! HTTP/1.0 403 Forbidden in C:\xampp\htdocs\php\simple_html_dom.php on line 70
Fatal error: Call to a member function find() on a non-object in C:\xampp\htdocs\php\frage.php on line 5
The function which contains line 70 in simple_html_dom.php looks like that:
//65 function file_get_html($url, $use_include_path = false, $context=null, $offset = -1, $maxLen=-1, $lowercase = true, $forceTagsClosed=true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT)
//66 {
//67 // We DO force the tags to be terminated.
//68 $dom = new simple_html_dom(null, $lowercase, $forceTagsClosed, $target_charset, $defaultBRText);
//69 // For sourceforge users: uncomment the next line and comment the retreive_url_contents line 2 lines down if it is not already done.
//70 $contents = file_get_contents($url, $use_include_path, $context, $offset);
//71 // Paperg - use our own mechanism for getting the contents as we want to control the timeout.
//72 // $contents = retrieve_url_contents($url);
//73 if (empty($contents))
//74 {
//75 return false;
//76 }
//77 // The second parameter can force the selectors to all be lowercase.
//78 $dom->load($contents, $lowercase, $stripRN);
//79 return $dom;
//80 }
Is there any way to get the script working for every IRI in Wikipedia or Wikisource? (I know that there is not always a span#ws-title, that’s not my problem.)
Awesome question! 🙂
They seem to filter by user agent, try something like
you can probably skip the urlencode part since I just used it to test whether the encode was right.
Please note that wikisource obviously dislikes automated parsing of their content on the web pages. Nonetheless there might be an API available for wikibots and the like, ask them or search the community pages. The API will be much easier to handle anyway.