i just started dabbling in php and i’m afraid i need some help to

Question

0

Asked: May 27, 20262026-05-27T06:47:59+00:00 2026-05-27T06:47:59+00:00

i just started dabbling in php and i’m afraid i need some help to

0

i just started dabbling in php and i’m afraid i need some help to figure out how to manipulate utf-8 strings.

I’m working in ubuntu 11.10 x86, php version 5.3.6-13ubuntu3.2. I have a utf-8 encoded file (vim :set encoding confirms this) which i then proceed to reading it using

$file = fopen("file.txt", "r");
while(!feof($file)){
    $line = fgets($file);
    //...
}
fclose($file);

using mb_detect_encoding($line) reports UTF-8
If i do echo $line I can see the line properly (no mangled characters) in the browser
- so I guess everything is fine with browser and apache. Though i did search my apache configuration for AddDefaultCharset and tried adding http meta-tags for character encoding (just in case)

When i try to split the string using $arr = mb_split(';',$line) the fields of the resulting array contain mangled utf-8 characters (mb_detect_encoding($arr[0]) reports utf-8 as well).

So echo $arr[0] will result in something like this: ï»¿Î‘Î˜Î—ÎÎ.

I have tried setting mb_detect_order('utf-8'), mb_internal_encoding('utf-8'), but nothing changed. I also tried to manually detect utf-8 using this w3 perl regex because i read somewhere that mb_detect_encoding can sometimes fail (myth?), but results were the same as well.

So my question is how can i properly split the string? Is going down the mb_ path the wrong way? What am I missing?

Thank you for your help!

UPDATE: I’m adding sample strings and base64 equivalents (thanks to @chris’ for his suggestion)

1. original string: "ΑΘΗΝΑ;ΑΙΓΑΛΕΩ;12242;37.99452;23.6889"
2. base64 encoded: "zpHOmM6Xzp3OkTvOkc6ZzpPOkc6bzpXOqTsxMjI0MjszNy45OTQ1MjsyMy42ODg5"
3. first part (the equivalent of "ΑΘΗΝΑ") base64 encoded before splitting: "zpHOmM6Xzp3OkQ=="
4. first part ($arr[0] after splitting): "ï»¿Î‘Î˜Î—ÎÎ‘"
5. first part after splitting base64 encoded: "77u/zpHOmM6Xzp3OkQ=="

Ok, so after doing this there seems to be a 77u/ difference between 3. and 5. which according to this is a utf-8 BOM mark. So how can i avoid it?

UPDATE 2: I woke up refreshed today and with your tips in mind i tried it again. It seems that $line=fgets($file) reads correctly the first line (no mangled chars), and fails for each subsequent line. So then i base64_encoded the first and second line, and the 77u/ bom appeared on the base64’d string of the first line only. I then opened up the offending file in vim, and entered :set nobomb :w to save the file without the bom. Firing up php again showed that the first line was also mangled now. Based on @hakre’s remove_utf8_bom i added it’s complementary function

function add_utf8_bom($str){
    $bom= "\xEF\xBB\xBF";
    return substr($str,0,3)===$bom?$str:$bom.$str;
}

and voila each line is read correctly now.

I do not much like this solution, as it seems very very hackish (i can’t believe that an entire framework/language does not provide for a way to deal with nobombed strings). So do you know of an alternate approach? Otherwise I’ll proceed with the above.

Thanks to @chris, @hakre and @jacob for their time!

UPDATE 3 (solution): It turns out after all that it was a browser thing: it was not enough to add header('Content-type: text/html; charset=UTF-8') and meta-tags like <meta http-equiv="Content-type" value="text/html; charset=UTF-8" />. It also had to be properly enclosed inside an <html><body> section or the browser would not understand the encoding correctly. Thanks to @jake for his suggestion.

Morale of the story: I should learn more about html before trying coding for the browser in the first place. Thanks for your help and patience everyone.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-27T06:48:00+00:00

When you write debug/testing scripts in php, make sure you output a more or less valid HTML page.

I like to use a PHP file similar to the following:

<!DOCTYPE html>
<html>
  <head>
    <meta charset=utf-8>
    <title>Test page for project XY</title>
  </head>
  <body>
     <h1>Test Page</h1>
     <pre><?php
        echo print_r($_GET,1);
     ?></pre>
  </body>
</html>

If you don’t include any HTML tags, the browser might interpret the file as a text file and all kinds of weird things could happen. In your case, I assume the browser interpreted the file as a Latin1 encoded text file. I assume it worked with the BOM, because whenever the BOM was present, the browser recognized the file as a UTF-8 file.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

i just started dabbling in php and i’m afraid i need some help to

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply