I have recently taken up the activity of parsing binary data with Python but am confused by the way “byte” items are treated by Python. Take for e.g. the following interpreter conversation:
>>> f = open('somefile.gz', 'rb')
>>> f
<open file 'textfile.gz', mode 'rb' at 0xb77f4d88>
>>> bytes = f.read()
>>> bytes[0]
'\x1f'
>>> len(bytes[0])
1
>>> int(bytes[0]) <---- calling __str__ automatically on bytes[0] ?
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: '\x1f'
The above session shows that bytes[0] has the size of 1 byte but the __str__ representation is a hexadecimal one. No worries, but when I try to treat bytes[0] as a single byte, I get funky behaviour.
If I want to parse/interpret a binary stream based on some specification where the specification includes representation in hexadecimal, binary and decimal, how would I go about doing that.
An e.g. would be “first two bytes are \xbeef, the next is a decimal 8 followed by a packed bit field where each of the 8 bits of the byte represent some flag? I guess there are a few modules out there which make this task easy but I’d want to do it from scratch.
I have seen references to struct module but is there no way of checking the bytes read directly without introducing a new module? Something like bytes[0] == 0xbeef ?
Can someone please help me out with how normally folks parse binary data conforming a specification using Python? Thanks.
You’re using Python 2.x. Prior to Python 3.0, reading a file, even a binary file, returns a string. What you’re calling a “bytes” object is really a string. Indexing into a string as you do with “bytes[0]” just returns a 1-character string.
The struct module would probably be best suited to what you want, but you can do what you ask without it if you really want to:
“Something like bytes[0] == 0xbeef ?”
This won’t work because 0xbeef is a two-byte sequence, but bytes[0] is only a single byte. You can do this instead:
In Python 3.x, things work a little bit more like you’d expect. Reading a binary file returns a
bytesobject that behaves like a sequence of 1-byte unsigned integers, not as a string.