Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8858803
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 14, 20262026-06-14T14:54:41+00:00 2026-06-14T14:54:41+00:00

Let’s say I have a string in Python: >>> s = ‘python’ >>> len(s)

  • 0

Let’s say I have a string in Python:

>>> s = 'python'
>>> len(s)
6

Now I encode this string like this:

>>> b = s.encode('utf-8')
>>> b16 = s.encode('utf-16')
>>> b32 = s.encode('utf-32')

What I get from above operations is a bytes array — that is, b, b16 and b32 are just arrays of bytes (each byte being 8-bit long of course).

But we encoded the string. So, what does this mean? How do we attach the notion of “encoding” with the raw array of bytes?

The answer lies in the fact that each of these array of bytes is generated in a particular way. Let’s look at these arrays:

>>> [hex(x) for x in b]
['0x70', '0x79', '0x74', '0x68', '0x6f', '0x6e']

>>> len(b)
6

This array indicates that for each character we have one byte (because all the characters fall below 127). Hence, we can say that “encoding” the string to ‘utf-8’ collects each character’s corresponding code-point and puts it into the array. If the code point can not fit in one byte then utf-8 consumes two bytes. Hence utf-8 consumes least number of bytes possible.

>>> [hex(x) for x in b16]
['0xff', '0xfe', '0x70', '0x0', '0x79', '0x0', '0x74', '0x0', '0x68', '0x0', '0x6f', '0x0', '0x6e',  '0x0']

>>> len(b16)
14     # (2 + 6*2)

Here we can see that “encoding to utf-16” first puts a two byte BOM (FF FE) into the bytes array, and after that, for each character it puts two bytes into the array. (In our case, the second byte is always zero)

>>> [hex(x) for x in b32]
['0xff', '0xfe', '0x0', '0x0', '0x70', '0x0', '0x0', '0x0', '0x79', '0x0', '0x0', '0x0', '0x74', '0x0', '0x0', '0x0', '0x68', '0x0', '0x0', '0x0', '0x6f', '0x0', '0x0', '0x0', '0x6e', '0x0', '0x0', '0x0']

>>> len(b32)
28     # (2+ 6*4 + 2)

In the case of “encoding in utf-32”, we first put the BOM, then for each character we put four bytes, and lastly we put two zero bytes into the array.

Hence, we can say that the “encoding process” collects 1 2 or 4 bytes (depending on the encoding name) for each character in the string and prepends and appends more bytes to them to create the final result array of bytes.

Now, my questions:

  • Is my understanding of the encoding process correct or am I missing something?
  • We can see that the memory representation of the variables b, b16 and b32 is actually a list of bytes. What is the memory representation of the string? Exactly what is stored in memory for a string?
  • We know that when we do an encode(), each character’s corresponding code point is collected (code point corresponding to the encoding name) and put into an array or bytes. What exactly happens when we do a decode()?
  • We can see that in utf-16 and utf-32, a BOM is prepended, but why are two zero bytes appended in the utf-32 encoding?
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-14T14:54:42+00:00Added an answer on June 14, 2026 at 2:54 pm

    First of all, UTF-32 is a 4-byte encoding, so its BOM is a four byte sequence too:

    >>> import codecs
    >>> codecs.BOM_UTF32
    b'\xff\xfe\x00\x00'
    

    And because different computer architectures treat byte orders differently (called Endianess), there are two variants of the BOM, little and big endian:

    >>> codecs.BOM_UTF32_LE
    b'\xff\xfe\x00\x00'
    >>> codecs.BOM_UTF32_BE
    b'\x00\x00\xfe\xff'
    

    The purpose of the BOM is to communicate that order to the decoder; read the BOM and you know if it is big or little endian. So, those last two null bytes in your UTF-32 string are part of the last encoded character.

    The UTF-16 BOM is thus similar, in that there are two variants:

    >>> codecs.BOM_UTF16
    b'\xff\xfe'
    >>> codecs.BOM_UTF16_LE
    b'\xff\xfe'
    >>> codecs.BOM_UTF16_BE
    b'\xfe\xff'
    

    It depends on your computer architecture which one is used by default.

    UTF-8 doesn’t need a BOM at all; UTF-8 uses 1 or more bytes per character (adding bytes as needed to encode more complex values), but the order of those bytes is defined in the standard. Microsoft deemed it necessary to introduce a UTF-8 BOM anyway (so its Notepad application could detect UTF-8), but since the order of the BOM never varies its use is discouraged.

    As for what is stored by Python for unicode strings; that actually changed in Python 3.3. Before 3.3, internally at the C level, Python either stored UTF16 or UTF32 byte combinations, depending on whether or not Python was compiled with wide character support (see How to find out if Python is compiled with UCS-2 or UCS-4?, UCS-2 is essentially UTF-16 and UCS-4 is UTF-32). So, each character either takes 2 or 4 bytes of memory.

    As of Python 3.3, the internal representation uses the minimal number of bytes required to represent all characters in the string. For plain ASCII and Latin1-encodable text 1 byte is used, for the rest of the BMP 2 bytes are used, and text containing characters beyond that 4 bytes are used. Python switches between the formats as needed. Thus, storage has become a lot more efficient for most cases. For more detail see What’s New in Python 3.3.

    I can strongly recommend you read up on Unicode and Python with:

    • The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
    • The Python Unicode HOWTO
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Let's say I have a string like this: var str = /abcd/efgh/ijkl/xxx-1/xxx-2; How do
Let's say I have a sortable list like this: $(.song-list).sortable({ handle : '.pos_handle', axis
I have a string like this: La Torre Eiffel paragonata all’Everest What PHP function
Let's say I have a text file composed like this ##### typeofthread1 ##### typeofthread2
Let's say that I have classes like this: public class Parent { public int
Let's say I have this code: <p dataname=description> Hello this is a description. <a
Let's say I have the following classes : public class MyProductCode { private String
Let's suppose I have an XML file like this: <?xml version=1.0 encoding=ISO-8859-1?> <MIDIFile> <Event>
Let's say on a page I have alot of this repeated: <div class=entry> <h4>Magic:</h4>
Let's say I can call a method like this: core::get() . What is the

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.