Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6783575
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 26, 20262026-05-26T16:52:51+00:00 2026-05-26T16:52:51+00:00

I have the following string that I would like to Huffman-encode and store efficiently

  • 0

I have the following string that I would like to Huffman-encode and store efficiently into a bit array:

>>> print sequence
GTCAGGACAAGAAAGACAANTCCAATTNACATTATG|

The frequencies of the symbols in sequence are:

>>> print freqTuples
[(0.40540540540540543, 'A'), (0.1891891891891892, 'T'), (0.16216216216216217, 'C'), (0.16216216216216217, 'G'), (0.05405405405405406, 'N'), (0.02702702702702703, '|')]`

I translate this into a Huffman code dictionary:

>>> print codeDict
{'A': '1', 'C': '010', 'G': '001', 'N': '0110', 'T': '000', '|': '0111'}

I then used the Python bitstring package to translate the string, character by character, into an instance of the BitArray class, which I call bitArray, which contains bits for each character encoded with its respective Huffman code:

>>> print bitArray.bin
0b001000010100100110101100111100110101101100000100101100000001101010100000010000010111

Here is the bit array in bytes:

>>> print bitArray.tobytes()
!I\254\363[^D\260^Z\240Ap

I must use tobytes() instead of bytes, as the bit array I generate does not divide evenly into 8-bit segments.

When I calculate the storage efficiency of the BitArray representation (the ratio of the sizes of the bit array and the input string), I get worse performance than if I had left the input string unencoded:

>>> sys.getsizeof(bitArray.tobytes()) / float(len(sequence))
1.2972972973

Am I measuring storage efficiency correctly? (If I encode longer input strings, this ratio improves, but it seems to approach an asymptotic limit of around 0.28. I’d like to confirm if this is the right way to measure things.)

Edit

The following two approaches yield different answers:

>>> print len(bitArray.tobytes()) / float(len(mergedSequence))
0.297297297297

>>> print bitArray.len / (8.*len(mergedSequence))
0.283783783784

I’m not sure which to believe. But in the process of writing data to storage, I think I would need the byte representation, which makes me inclined towards choosing the first result.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-26T16:52:51+00:00Added an answer on May 26, 2026 at 4:52 pm
    >>> sys.getsizeof(bitArray.tobytes()) / float(len(sequence))
    1.2972972973
    

    Implies that the encoded version is 30% longer than the original sequence.

    I don’t think you want to use getsizeof here — if you want to minimize the size of the Python object, you should be using getsizeof(sequence) as well, rather than len.

    If instead, you want to do what Huffman coding is meant to do, and minimize the binary representation, then you want to use len on both (assuming the sequence is represented as one-byte-per-character).

    So, your real ratio is 11 / 37.

    I assume you’re using Huffman coding as an exercise, as this doesn’t seem like a logical way to efficiently store what is just a four-bit code with a termination character. At least it would be better to use arithmetic coding, which will allow you to use base-5 encoding instead of base-2, which is optimal for 5 possible characters.

    Really, I would assume in a sequence long enough to be worth compressing, there is a known ratio of G:A:C:T and / or fixed length 2-bit encoding will be just as efficient (the ratios approach 1:1:1:1) since you don’t really need to encode the termination character.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have a following string that I would like to parse into either a
I have a string indexed array that I would like to remove an item
I have the following string: I would surely like to go to school. Now,
I have the following in my string 1406984110015 what I would like is to
I have a string that I would like to validate a certain pattern for:
I have the following string that would require me to parse it via Regex
I have the following string and I would like to remove <bpt *>*</bpt> and
I have a string that i would like to convert. the string is image
I have following method that I would like to make shorter or faster if
I have the following variable which contains the following string. I would like to

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.