Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 480371
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 13, 20262026-05-13T00:53:09+00:00 2026-05-13T00:53:09+00:00

I need to insure that all my strings are utf8. Would it be better

  • 0

I need to insure that all my strings are utf8. Would it be better to check that input coming from a user is ascii-like or that it is utf8-like?

//KohanaPHP
function is_ascii($str) {
    return ! preg_match('/[^\x00-\x7F]/S', $str);
}

//Wordpress
function seems_utf8($Str) {
    for ($i=0; $i<strlen($Str); $i++) {
        if (ord($Str[$i]) < 0x80) continue; # 0bbbbbbb
        elseif ((ord($Str[$i]) & 0xE0) == 0xC0) $n=1; # 110bbbbb
        elseif ((ord($Str[$i]) & 0xF0) == 0xE0) $n=2; # 1110bbbb
        elseif ((ord($Str[$i]) & 0xF8) == 0xF0) $n=3; # 11110bbb
        elseif ((ord($Str[$i]) & 0xFC) == 0xF8) $n=4; # 111110bb
        elseif ((ord($Str[$i]) & 0xFE) == 0xFC) $n=5; # 1111110b
        else return false; # Does not match any model
        for ($j=0; $j<$n; $j++) { # n bytes matching 10bbbbbb follow ?
            if ((++$i == strlen($Str)) || ((ord($Str[$i]) & 0xC0) != 0x80))
            return false;
        }
    }
    return true;
}

I did some benchmarking on 100 strings (half valid utf8/ascii and half not) and found that seems_utf8() tasks 0.011 while is_ascii only takes 0.001. But my gut is telling me that you get what you pay for and the utf8 checking would be a better choice.

I’m planning on then doing something like this convert.

<?php

/* Example data */
$string[] = 'hello';
$string[] = 'asdfghjkl;qwertyuiop[]\zxcvbnm,./]12345657890-=+_)(*&^%$#@!';
$string[] = '';
$string[] = 'accentué';
$string[] = '»á½µÎ½Ï‰Î½ τὰ ';
$string[] = '???R??=8 ????? ++++¦??? ???2??????';
$string[] = 'hello¦ùó 5/5¡45-52ZÜ¿»'. "0x93". octdec('77'). decbin(26). "F???pp?? ??? ". '»á½µÎ½Ï‰Î½ τὰ ';


$time = microtime(true);

//Count the successes
$true = array(1 => 0, 0 => 0);

foreach($string as $s) {
    $r = seems_utf8($s);    //0.011

    print_pre(mb_substr($s, 0, 30). ' is '. ($r ? 'UTF-8' : 'non-UTF-8'));


    if( ! $r ) {

        $e = mb_detect_encoding($s, "auto");

        print_pre('Encoding: '. $e);

        //Convert
        $s = iconv($e, 'UTF-8//TRANSLIT', $s);

        print_pre(mb_substr($s, 0, 30). ' is now '. (seems_utf8($s) ? 'valid' : 'not'). ' UTF-8');
    }

}

print_pre($true);
print_pre((microtime(TRUE) - $time). ' seconds');

function print_pre() { print '<pre>'; print_r(func_get_args()); print '</pre>'; }
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-13T00:53:10+00:00Added an answer on May 13, 2026 at 12:53 am

    I’m not sure how necessary parts of this approach are. If you ask the user for UTF-8 input, and they give you “something else” throw it away and ask again.

    The various character set detecting functions out there are universally (and tragically, necessarily) imperfect. The ones in the MB library as well as the ones in iconv aren’t even that advanced compared to some of the stuff that’s out there. The mb_detect_encoding basically iterates through a list of character sets and returns the first one that makes the string it has in hand look valid. In this day and age it’s probably that several would return true (which is why the ordering is exposed through mb_detect_order()).

    Ensure your pages are provided with the correct HTTP & HTML character set declarations, and browsers should return data in the same. To be extra specific include the accept-charset declaration in your form tag. I’ve yet to discover a case where this was ignored that didn’t represent an attack.

    To check the encoding of a byte stream, you can simply use mb_check_encoding().

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Need help to convert code from asp control to input type to fetch file
I've got a form to get an address from a user, and send that
Need to apply a filter to a file like this: TUPAC_0006:1:1:2554:2356#0/1 0 * 0
Need help with a query that I wrote: I have three tables Company id
need a little help with this one. I have a form that I am
Need to search the directory/sub-directories to find a file, would prefer it to stop
I have all my rewrite rules setup and working in .htaccess but I need
I want to insure that there are no race conditions introduced by using a
Need a way for one service on a well-known Endpoint to return strings which
Need to load data from a single file with a 100,000+ records into multiple

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.