I have a file in UTF-8, where some lines contain the U+2028 Line Separator

Question

0

Asked: May 11, 20262026-05-11T22:42:05+00:00 2026-05-11T22:42:05+00:00

I have a file in UTF-8, where some lines contain the U+2028 Line Separator

0

I have a file in UTF-8, where some lines contain the U+2028 Line Separator character (http://www.fileformat.info/info/unicode/char/2028/index.htm). I don’t want it to be treated as a line break when I read lines from the file. Is there a way to exclude it from separators when I iterate over the file or use readlines()? (Besides reading the entire file into a string and then splitting by \n.) Thank you!

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-11T22:42:05+00:00

I can’t duplicate this behaviour in python 2.5, 2.6 or 3.0 on mac os x – U+2028 is always treated as non-endline. Could you go into more detail about where you see this error?

That said, here is a subclass of the “file” class that might do what you want:

#/usr/bin/python
# -*- coding: utf-8 -*-
class MyFile (file):
    def __init__(self, *arg, **kwarg):
        file.__init__(self, *arg, **kwarg)
        self.EOF = False
    def next(self, catchEOF = False):
        if self.EOF:
            raise StopIteration("End of file")
        try:
            nextLine= file.next(self)
        except StopIteration:
            self.EOF = True
            if not catchEOF:
                raise
            return ""
        if nextLine.decode("utf8")[-1] == u'\u2028':
            return nextLine+self.next(catchEOF = True)
        else:
            return nextLine

A = MyFile("someUnicode.txt")
for line in A:
    print line.strip("\n").decode("utf8")

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a file in UTF-8, where some lines contain the U+2028 Line Separator

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply