I have a set of data, but I need to work only with utf-8

Question

0

Editorial Team

Asked: June 13, 20262026-06-13T23:03:57+00:00 2026-06-13T23:03:57+00:00

I have a set of data, but I need to work only with utf-8

0

I have a set of data, but I need to work only with utf-8 data, so I need to delete all data with non-utf-8 symbols.

When I try to work with these files, I receive:

UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 3062: character maps to <undefined> and UnicodeDecodeError: 'utf8' codec can't decode byte 0xc1 in position 1576: invalid start byte

My code

class Corpus:
        def __init__(self,path_to_dir=None):
                self.path_to_dir = path_to_dir if path_to_dir else []


        def emails_as_string(self):
                for file_name in os.listdir(self.path_to_dir):
                        if not file_name.startswith("!"):
                                with io.open(self.add_slash(self.path_to_dir)+file_name,'r', encoding ='utf-8') as body:
                                        yield[file_name,body.read()]                        

        def add_slash(self, path):
                if path.endswith("/"): return path
                return path + "/"

I recive error here yield[file_name,body.read()] and herelist_of_emails = mailsrch.findall(text), but when I work with utf-8 all great.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-13T23:03:58+00:00

I suspect you want to use the errors='ignore' argument on bytes.decode. See http://docs.python.org/3/howto/unicode.html#unicode-howto and http://docs.python.org/3/library/stdtypes.html#bytes.decode .for more info.

Edit:

Here’s an example showing a good way to do this:

for file_name in os.listdir(self.path_to_dir):
    if not file_name.startswith("!"):
        fullpath = os.path.join(self.path_to_dir, file_name)
        with open(fullpath, 'r', encoding ='utf-8', errors='ignore') as body:
            yield [file_name, body.read()]

Using os.path.join, you can eliminate your add_slash method, and ensure that it works cross-platform.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a set of data, but I need to work only with utf-8

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply