This is the error that I receive when I try to run tolower() on a character vector from a file that cannot be changed (at least, not manually – too large).
Error in tolower(m) : invalid multibyte string X
It seems to be French company names that are the problem with the É character. Although I have not investigated all of them (also not possible to do so manually).
It’s strange, because my thought was that encoding issues would have been identified during read.csv(), rather than during operations after the fact.
Is there a quick way to remove these multibyte strings? Or, perhaps a way to identify and convert? Or even just ignore them entirely?
Here’s how I solved my problem:
First, I opened the raw data in a texteditor (Geany, in this case), clicked properties and identified the Encoding type.
After which I used the
iconv()function.To be more specific, I did this for every column of the
data.framefrom the imported CSV. Important to note that I setstringsAsFactors=FALSEin myread.csv()call.