Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8132467
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 6, 20262026-06-06T09:21:26+00:00 2026-06-06T09:21:26+00:00

I have a string and start and length with which to extract a substring.

  • 0

I have a string and start and length with which to extract a substring. Both positions (start and length) are based on the byte offsets in the original UTF8 string.

However, there is a problem:

The start and length are in bytes, so I cannot use “substring”. The UTF8 string contains several multi-byte characters. Is there a hyper-efficient way of doing this? (I don’t need to decode the bytes…)

Example:
var orig = ‘你好吗?’

The s,e might be 3,3 to extract the second character (好). I’m looking for

var result = orig.substringBytes(3,3);

Help!

Update #1 In C/C++ I would just cast it to a byte array, but not sure if there is an equivalent in javascript. BTW, yes we could parse it into a byte array and parse it back to a string, but it seems that there should be a quick way to cut it at the right place. Imagine that ‘orig’ is 1000000 characters, and s = 6 bytes and l = 3 bytes.

Update #2 Thanks to zerkms helpful re-direction, I ended up with the following, which does NOT work right – works right for multibyte but messed up for single byte.

function substrBytes(str, start, length)
{
    var ch, startIx = 0, endIx = 0, re = '';
    for (var i = 0; 0 < str.length; i++)
    {
        startIx = endIx++;

        ch = str.charCodeAt(i);
        do {
            ch = ch >> 8;   // a better way may exist to measure ch len
            endIx++;
        }
        while (ch);

        if (endIx > start + length)
        {
            return re;
        }
        else if (startIx >= start)
        {
            re += str[i];
        }
    }
}

Update #3 I don’t think shifting the char code really works. I’m reading two bytes when the correct answer is three… somehow I always forget this. The codepoint is the same for UTF8 and UTF16, but the number of bytes taken up on encoding depends on the encoding!!! So this is not the right way to do this.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-06T09:21:28+00:00Added an answer on June 6, 2026 at 9:21 am

    I had a fun time fiddling with this. Hope this helps.

    Because Javascript does not allow direct byte access on a string, the only way to find the start position is a forward scan.


    Update #3 I don’t think shifting the char code really works. I’m reading two bytes when the correct answer is three… somehow I always forget this. The codepoint is the same for UTF8 and UTF16, but the number of bytes taken up on encoding depends on the encoding!!! So this is not the right way to do this.

    This is not correct – Actually there is no UTF-8 string in javascript. According to the ECMAScript 262 specification all strings – regardless of the input encoding – must be internally stored as UTF-16 ("[sequence of] 16-bit unsigned integers").

    Considering this, the 8 bit shift is correct (but unnecessary).

    Wrong is the assumption that your character is stored as a 3-byte sequence…
    In fact, all characters in a JS (ECMA-262) string are 16 bit (2 byte) long.

    This can be worked around by converting the multibyte characters to utf-8 manually, as shown in the code below.


    UPDATE This solution doesn’t handle codepoints >= U+10000 including emoji. See APerson’s Answer for a more complete solution.


    See the details explained in my example code:

    function encode_utf8( s )
    {
      return unescape( encodeURIComponent( s ) );
    }
    
    function substr_utf8_bytes(str, startInBytes, lengthInBytes) {
    
       /* this function scans a multibyte string and returns a substring. 
        * arguments are start position and length, both defined in bytes.
        * 
        * this is tricky, because javascript only allows character level 
        * and not byte level access on strings. Also, all strings are stored
        * in utf-16 internally - so we need to convert characters to utf-8
        * to detect their length in utf-8 encoding.
        *
        * the startInBytes and lengthInBytes parameters are based on byte 
        * positions in a utf-8 encoded string.
        * in utf-8, for example: 
        *       "a" is 1 byte, 
                "ü" is 2 byte, 
           and  "你" is 3 byte.
        *
        * NOTE:
        * according to ECMAScript 262 all strings are stored as a sequence
        * of 16-bit characters. so we need a encode_utf8() function to safely
        * detect the length our character would have in a utf8 representation.
        * 
        * http://www.ecma-international.org/publications/files/ecma-st/ECMA-262.pdf
        * see "4.3.16 String Value":
        * > Although each value usually represents a single 16-bit unit of 
        * > UTF-16 text, the language does not place any restrictions or 
        * > requirements on the values except that they be 16-bit unsigned 
        * > integers.
        */
    
        var resultStr = '';
        var startInChars = 0;
    
        // scan string forward to find index of first character
        // (convert start position in byte to start position in characters)
    
        for (bytePos = 0; bytePos < startInBytes; startInChars++) {
    
            // get numeric code of character (is >128 for multibyte character)
            // and increase "bytePos" for each byte of the character sequence
    
            ch = str.charCodeAt(startInChars);
            bytePos += (ch < 128) ? 1 : encode_utf8(str[startInChars]).length;
        }
    
        // now that we have the position of the starting character,
        // we can built the resulting substring
    
        // as we don't know the end position in chars yet, we start with a mix of
        // chars and bytes. we decrease "end" by the byte count of each selected 
        // character to end up in the right position
        end = startInChars + lengthInBytes - 1;
    
        for (n = startInChars; startInChars <= end; n++) {
            // get numeric code of character (is >128 for multibyte character)
            // and decrease "end" for each byte of the character sequence
            ch = str.charCodeAt(n);
            end -= (ch < 128) ? 1 : encode_utf8(str[n]).length;
    
            resultStr += str[n];
        }
    
        return resultStr;
    }
    
    var orig = 'abc你好吗?';
    
    alert('res: ' + substr_utf8_bytes(orig, 0, 2)); // alerts: "ab"
    alert('res: ' + substr_utf8_bytes(orig, 2, 1)); // alerts: "c"
    alert('res: ' + substr_utf8_bytes(orig, 3, 3)); // alerts: "你"
    alert('res: ' + substr_utf8_bytes(orig, 6, 6)); // alerts: "好吗"
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Now I have a string which has many substring like href=http://www.AAA.com and other characters,
I have the below piece of code which Prefixs a string to the start
I have a string input from which I need to extract simple information, here
I have the following method: public object[] GetEventsByUser(DateTime start, DateTime end, string fullUrl) The
I have this method: public static int what(String str, char start, char end) {
I have a string which contains the text of an article. This is sprinkled
I have a custom PHP extension which compares each byte in a binary data
Ok so I have a class called Dog() which takes two parameters, a string
So, I have this next button which calls a string array and when count
We have a spring-enabled java desktop application started through Java Web Start. The JNLP

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.