For input text files, I know that .seek and .tell both operate with bytes,

Question

0

Asked: June 6, 20262026-06-06T14:34:34+00:00 2026-06-06T14:34:34+00:00

For input text files, I know that .seek and .tell both operate with bytes,

0

For input text files, I know that .seek and .tell both operate with bytes, usually – that is, .seek seeks a certain number of bytes in relation to a point specified by its given arguments, and .tell returns the number of bytes since the beginning of the file.

My question is: does this work the same way when using other encodings like utf-8? I know utf-8, for example, requires several bytes for some characters.

It would seem that if those methods still deal with bytes when parsing utf-8 files, then unexpected behavior could result (for instance, the cursor could end up inside of a character’s multi-byte encoding, or a multi-byte character could register as several characters).

If so, are there other methods to do the same tasks? Especially for when parsing a file requires information about the cursor’s position in terms of characters.

On the other hand, if you specify the encoding in the open() function …

infile = open(filename, encoding=’utf-8′)

Does the behavior of .seek and .tell change?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-06T14:34:35+00:00

Assuming you’re using io.open() (not the same as the builtin open()), then using text mode gets you an instance of a io.TextIO, so this should anwser your question:

Text I/O over a binary storage (such as a file) is significantly
slower than binary I/O over the same storage, because it implies
conversions from unicode to binary data using a character codec. This
can become noticeable if you handle huge amounts of text data (for
example very large log files). Also, TextIOWrapper.tell() and
TextIOWrapper.seek() are both quite slow due to the reconstruction
algorithm used.

NOTE: You should also be aware, that this still doesn’t guarantee that seek() will skip over characters, but rather unicode codepoints (a single character can be composed out of more then one codepoint, for example ą can be written as u'\u0105' or u'a\u0328' – both will print the same character).

Source: http://docs.python.org/library/io.html#id1

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

For input text files, I know that .seek and .tell both operate with bytes,

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply