I’m working on a WebDAV implementation for PHP. In order to make it easier for Windows and other operating systems to work together, I need jump through some character encoding hoops.
Windows uses ISO-8859-1 in it’s HTTP request, while most other clients encode anything beyond ascii as UTF-8.
My first approach was to ignore this altogether, but I quickly ran into issues when returning urls. I then figured it’s probably best to normalize all urls.
Using ü as an example. This will get sent over the wire by OS/X as
u%CC%88 (this is codepoint U+0308)
Windows sents this as:
%FC (latin1)
But, doing a utf8_encode on %FC, I get :
%C3%BC (this is codepoint U+00FC)
Should I treat %C3%BC and u%CC%88 as the same thing? If so.. how? Not touching it seems to work OK for windows. It somehow understands that it’s a unicode character, but updating the same file throws an error (for no particular reason).
I’d be happy to provide more information.
I hate answering my own questions, but here goes.
I ended up not bothering. Did extensive research on how various operating systems encode, and handle encodings. Turns out that in most cases other os’s handle paths using other normalization forms alright. Windows worked a bit shitty though, but it works.
Whenever I receive a path that’s actually non-utf8 altogether, I try to detect the encoding and convert it to UTF-8.