For input text files, I know that .seek and .tell both operate with bytes, usually – that is, .seek seeks a certain number of bytes in relation to a point specified by its given arguments, and .tell returns the number of bytes since the beginning of the file.
My question is: does this work the same way when using other encodings like utf-8? I know utf-8, for example, requires several bytes for some characters.
It would seem that if those methods still deal with bytes when parsing utf-8 files, then unexpected behavior could result (for instance, the cursor could end up inside of a character’s multi-byte encoding, or a multi-byte character could register as several characters).
If so, are there other methods to do the same tasks? Especially for when parsing a file requires information about the cursor’s position in terms of characters.
On the other hand, if you specify the encoding in the open() function …
infile = open(filename, encoding=’utf-8′)
Does the behavior of .seek and .tell change?
Assuming you’re using
io.open()(not the same as the builtinopen()), then using text mode gets you an instance of aio.TextIO, so this should anwser your question:NOTE: You should also be aware, that this still doesn’t guarantee that seek() will skip over characters, but rather unicode codepoints (a single character can be composed out of more then one codepoint, for example
ącan be written asu'\u0105'oru'a\u0328'– both will print the same character).Source: http://docs.python.org/library/io.html#id1