I am running an Apache/PHP/MySQL server (xampp) on my local machine under Windows 7. There I have installed the MediaWiki-Software, together with many extensions. My aim is to download some pages from Wikipedia and show them locally. Everything runs fine, except for one big problem:
The image files in the German Wikipedia contain German Umlaute (ä, ö, ü) in their file names. This cannot be changed, because the articles link to the names with the Umlaute.
When I try to import these images (via the maintenance/importImages.php script), it does not work. I traced the code and figured out why:
When PHP scans the directory for files, it reads the file names as ANSI strings. MediaWiki internally requires that all strings are utf-8. So the Umlaut in the file name is interpreted as part of a (non-existing) unicode character, which breaks the script.
If I manually add a call to utf8_encode() into the script, the name is fine then, and is correctly added to the database. But the file actually written to the “images” directory has a broken name – two special characters instead of the umlaut. The reason is that the PHP script sends utf-8 strings to the filesystem functions (“copy”, …), but the operating system expects ANSI strings there. I could manually add a call to utf8_decode() before each file system call, but there are thousands of them!
In short form again: The OS works in ANSI (this cannot easily be changed for windows) and the MediaWiki software inside the PHP Server works in utf-8 (also cannot be changed). Is there a way to automatically encode/decode file name strings everytime they go into/out of the PHP server?
I was already playing around with mb_internal_encoding() and mb_http_output(), but this did not change anything: MediaWiki uses hard-coded functions which only work on utf-8 strings.
You need to rename all the files on the filesystem before you import them so they match the data that is inside the database.
Just ensure when the UTF-8 encoded binary sequence of the filename hits the filesystem, the file is found.
So you need to rename each file from it’s current name to the binary sequence when hit.
For your webserver you might need as well to introduce a rewrite-rule to take care of the incomming HTTP requests as the webserver might use some other file-system handling than PHP itself.
Also check the system configuration of your file-system which codepage is used. That can differ.