I have a PHP script (running on a Linux server) that ouputs the names of some files on the server. It outputs these file names in a simple text-only format.
This output is read from a VB.NET program by using HttpWebRequest, HttpWebResponse, and a StreamReader.
The problem is that some of the file names being output contain… unusual characters. Specifically, the ‘section’ symbol (§).
If I view the output of the PHP script in a web browser, the symbol appears fine.
But when I read the output of the PHP script into my .NET program, the symbol doesn’t appear correctly (it appears as a generic ‘block’ symbol).
I’ve tried all the different character encoding options that you can use when reading the response stream (from the HttpWebResponse). I’ve tried outputting the stream directly to a text file (no good), displaying it in a TextBox (no good), and even when viewing the results directly in the Visual Studio debugger, the character appears as a block instead of as the ‘section’ symbol.
I’ve examined the output in a hex editor (as suggested by a related question, ‘how do you troubleshoot character encoding problems.’
When I write out the section symbol (§) from .NET itself, the hex bytes I see representing it are ‘c2 a7’ (makes sense if it’s unicode, right? requires two bytes?). When I write out the output from the PHP script directly to a file and examine that with a hex editor, the symbol shows up as ‘ef bf bd’ – three bytes instead of two?
I’m at a loss as to what to do – if I need to specify some other character encoding, or if I’m missing something obvious about this.
Here’s the code that’s used to get the output of the PHP script (VB-style comments modified so they appear correctly on this site):
Dim myRequest As HttpWebRequest = WebRequest.Create('http://www.example.com/sample.php') Dim myResponse As HttpWebResponse = myRequest.GetResponse() // read the response stream Dim myReader As New StreamReader(myResponse.GetResponseStream()) // read the entire output in one block (just as an example) Dim theOutput as String = myReader.ReadToEnd()
Any ideas?
- Am I using the wrong kind of StreamReader? (I’ve tried passing the character encoding in the call to create the new StreamReader – I’ve tried all the ones that are in System.Text.Encoding – UTF-8, UTF-7, ASCII, UTF-32, Unicode, etc.)
- Should I be using a different method for reading the output of the PHP script?
- Is there something I should be doing different on the PHP-side when outputting the text?
UPDATED INFO:
- The output from PHP is specifically encoded UTF-8 by calling:
utf8_encode($file); - When I wrote out the symbol from .NET, I copied and pasted the symbol from the Character Map app in Windows. I also copied & pasted it directly from the file’s name (in Windows) and from this web page itself – all gave the same hex value when written out (c2 a7).
- Yes, the ‘section symbol’ I’m talking about is U+00A7 (ALT+0167 on Windows, according to Character Map).
- The content-type is set explicitly via
header('Content-Type: text/html; charset=utf-8');right at the beginning of the PHP script.
UPDATE:
Figured it out myself, but I couldn’t have done it without the help from the people who answered. Thank you!
Figured it out!!
Like so many things, it’s simple in retrospect!
Jon Skeet was correct – it was meant to be UTF-8, but definitely wasn’t.
Turns out, in the original script I was using (before I stripped it down to make it simpler to debug), there was some additional text output by the script which was not wrapped in a
utf8_encode()call. This caused the entire page to be output in ISO-8859-1 instead of UTF-8.I noticed this when I checked my testing script’s ‘encoding’ property (in Firefox, ‘View Page Info’). It was UTF-8 for the testing script, but ISO-8859-1. The production script also printed the date of the file; this was not wrapped in a call to utf8_encode – and that caused the entire output to change to ISO-08859-1.
[Insert sound of me slapping my forehead here]
Thanks to everyone who answered! You were very helpful!