I have a Python script that read a list of path names from a file and open them using the gzip module. It works well under Linux. But when I used it under Windows, I met an error when calling the gzip.open function. The error message is as follows:
File "C:\dev_tools\Python27\lib\gzip.py", line 34, in open
return GzipFile(filename, mode, compresslevel)
File "C:\dev_tools\Python27\lib\gzip.py", line 89, in __init__
fileobj = self.myfileobj = __builtin__.open(filename, mode or 'rb')
TypeError: file() argument 1 must be encoded string without NULL bytes, not str
The filename should be something like
‘G:\ext_pt1\cfx33_50instr4_testset\cfx33_50instr4_0-99\cfx33_50instr4_cov\cfx33_50instr4_id0_cov\cfx33_50instr4_id0.detail.rpt.gz’
But when I printed the filename, it printed out something like
‘ ■G : \ e x t _ p t 1 \ c f x 3 3 _ 5 0 i n s t r 4 _ t e s t s e t \
c f x 3 3 _ 5 0 i n s t r 4 _ 0 – 9 9 \ c f x 3 3 _ 5 0 i n s t r 4 _
c o v \ c f x 3 3 _ 5 0 i n s t r 4 _ i d 0 _ c o v \ c f x 3 3 _ 5 0
i n s t r 4 _ i d 0 . d e t a i l . r p t . g z’
And when I printed repr(filename), it printed out something like
‘\xff\xfeG\x00:\x00\\x00e\x00x\x00t\x00_\x00p\x00t\x001\x00\\x00c\x00f\x00x\x003\x003\x00_\x005\x000\x00i\x00n\x00s\x00t\x00r\x004\x00_\x00t\x00e\x00s\x00t\x00s\x00e\x00t\x00\\x00c\x00f\x00x\x003\x003\x00_\x005\x000\x00i\x00n\x00\x00t\x
00r\x004\x00_\x000\x00-\x009\x009\x00\\x00c\x00f\x00x\x003\x003\x00_\x005\x000\x00i\x00n\x00\x00t\x00r\x004\x00_\x00c\x00o\x00v\x00\\x00c\x00f\x00x\x003\x003\x00_\x005\x000\x00i\x00n\x00s\x00t\x00r\x004\x00_\x00i\x00d\x000\x00_\x00c\x00o\x00v\x00\\x00c\x00f\x00x\x003\x003\x00_\x005\x000\x00i\x00n\x00s\x00t\x00r\x004\x00_\x00i\x00d\x000\x00.\x00d\x00e\x00t\x00a\x00i\x00l\x00.\x00r\x00p\x00t\x00.\x00g\x00z\x00’
I don’t know why Python added those spaces (possibly the NULL bytes?) when it read the file. Does anyone have any clue?
Python has not added anything; it has merely read what is in the file. You have a little-endian UTF-16 string there, as you can plainly tell by the byte-order mark in the first two bytes. If you are not expecting this, you could convert it to ASCII (assuming it doesn’t have any non-ASCII characters).
Or just convert it to proper Unicode and use it that way, if Windows will tolerate it:
Above, I have manually specified the byte order and then stripped off the BOM, rather than specifying “utf-16” as the encoding and letting Python figure out the byte order. This is because the BOM is going to be found once at the beginning of the file, not at the beginning of each line, so if you are converting the lines to Unicode one at a time, you won’t have a BOM most of the time.
However, it might make more sense to go back to the source of that file and figure out why it’s being saved in little-endian UTF-16 if you expected ASCII. Is the file generated the same way on Linux and Windows, for instance? Has it been touched by a text editor that defaults to saving as Unicode? Etc.