Note that this is not about "strict Unicode programming" per…

Question

0

Asked: May 10, 20262026-05-10T22:23:38+00:00 2026-05-10T22:23:38+00:00

I have a PHP script (running on a Linux server) that ouputs the names

0

I have a PHP script (running on a Linux server) that ouputs the names of some files on the server. It outputs these file names in a simple text-only format.

This output is read from a VB.NET program by using HttpWebRequest, HttpWebResponse, and a StreamReader.

The problem is that some of the file names being output contain… unusual characters. Specifically, the ‘section’ symbol (§).

If I view the output of the PHP script in a web browser, the symbol appears fine.

But when I read the output of the PHP script into my .NET program, the symbol doesn’t appear correctly (it appears as a generic ‘block’ symbol).

I’ve tried all the different character encoding options that you can use when reading the response stream (from the HttpWebResponse). I’ve tried outputting the stream directly to a text file (no good), displaying it in a TextBox (no good), and even when viewing the results directly in the Visual Studio debugger, the character appears as a block instead of as the ‘section’ symbol.

I’ve examined the output in a hex editor (as suggested by a related question, ‘how do you troubleshoot character encoding problems.’

When I write out the section symbol (§) from .NET itself, the hex bytes I see representing it are ‘c2 a7’ (makes sense if it’s unicode, right? requires two bytes?). When I write out the output from the PHP script directly to a file and examine that with a hex editor, the symbol shows up as ‘ef bf bd’ – three bytes instead of two?

I’m at a loss as to what to do – if I need to specify some other character encoding, or if I’m missing something obvious about this.

Here’s the code that’s used to get the output of the PHP script (VB-style comments modified so they appear correctly on this site):

 Dim myRequest As HttpWebRequest = WebRequest.Create('http://www.example.com/sample.php')  Dim myResponse As HttpWebResponse = myRequest.GetResponse()  // read the response stream Dim myReader As New StreamReader(myResponse.GetResponseStream())  // read the entire output in one block (just as an example) Dim theOutput as String = myReader.ReadToEnd()

Any ideas?

Am I using the wrong kind of StreamReader? (I’ve tried passing the character encoding in the call to create the new StreamReader – I’ve tried all the ones that are in System.Text.Encoding – UTF-8, UTF-7, ASCII, UTF-32, Unicode, etc.)
Should I be using a different method for reading the output of the PHP script?
Is there something I should be doing different on the PHP-side when outputting the text?

UPDATED INFO:

The output from PHP is specifically encoded UTF-8 by calling: utf8_encode($file);
When I wrote out the symbol from .NET, I copied and pasted the symbol from the Character Map app in Windows. I also copied & pasted it directly from the file’s name (in Windows) and from this web page itself – all gave the same hex value when written out (c2 a7).
Yes, the ‘section symbol’ I’m talking about is U+00A7 (ALT+0167 on Windows, according to Character Map).
The content-type is set explicitly via header('Content-Type: text/html; charset=utf-8'); right at the beginning of the PHP script.

UPDATE:

Figured it out myself, but I couldn’t have done it without the help from the people who answered. Thank you!

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

score 0 · Answer 1 · 2026-05-10T22:23:38+00:00

Figured it out!!

Like so many things, it’s simple in retrospect!

Jon Skeet was correct – it was meant to be UTF-8, but definitely wasn’t.

Turns out, in the original script I was using (before I stripped it down to make it simpler to debug), there was some additional text output by the script which was not wrapped in a utf8_encode() call. This caused the entire page to be output in ISO-8859-1 instead of UTF-8.

I noticed this when I checked my testing script’s ‘encoding’ property (in Firefox, ‘View Page Info’). It was UTF-8 for the testing script, but ISO-8859-1. The production script also printed the date of the file; this was not wrapped in a call to utf8_encode – and that caused the entire output to change to ISO-08859-1.

[Insert sound of me slapping my forehead here]

Thanks to everyone who answered! You were very helpful!

How to approach applying for a job at a company ...

How to handle personal stress caused by utterly incompetent and ...

What is a programmer’s life like?

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions