I have a set of data, but I need to work only with utf-8 data, so I need to delete all data with non-utf-8 symbols.
When I try to work with these files, I receive:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 3062: character maps to <undefined> and UnicodeDecodeError: 'utf8' codec can't decode byte 0xc1 in position 1576: invalid start byte
My code
class Corpus:
def __init__(self,path_to_dir=None):
self.path_to_dir = path_to_dir if path_to_dir else []
def emails_as_string(self):
for file_name in os.listdir(self.path_to_dir):
if not file_name.startswith("!"):
with io.open(self.add_slash(self.path_to_dir)+file_name,'r', encoding ='utf-8') as body:
yield[file_name,body.read()]
def add_slash(self, path):
if path.endswith("/"): return path
return path + "/"
I recive error here yield[file_name,body.read()] and herelist_of_emails = mailsrch.findall(text), but when I work with utf-8 all great.
I suspect you want to use the
errors='ignore'argument onbytes.decode. See http://docs.python.org/3/howto/unicode.html#unicode-howto and http://docs.python.org/3/library/stdtypes.html#bytes.decode .for more info.Edit:
Here’s an example showing a good way to do this:
Using
os.path.join, you can eliminate youradd_slashmethod, and ensure that it works cross-platform.