I’ve got a bunch of input data, sometimes I get some garbage characters, example:
âDots Baby Shower Invitationsâ
Clearly at some point in its past it was "Dots Baby Shower Invitations". But it came to me garbled. I’d be happy to just remove the garbage â characters in cases like this.
But my data set is very large, just removing all non english characters might be somewhat naïve, as in the case of the word naïve. I wouldn’t want ï to be removed of course.
So is there a potentially automated solution to this problem? Has anyone come before me with this issue? Is this a case of “computers aren’t as smart as humans”?
You could use an english dictionary like WordNet and modify only the words that cannot be found in there.
For example naïve contains a “strange” character, but is in the dictionary, so it doesn’t get changed. âDots on the other hand, also contains a strange char, but won’t (hopefully) be in the dictionary, so it will be modified and the â will be deleted.
This might be too much effort, but as you said you needed a working solution fast maybe it’s worth a try… and it will probably work better that a quickly-hacked heuristic!