I’ve got a bunch of input data, sometimes I get some garbage characters, example:

Question

0

Editorial Team

Asked: June 17, 20262026-06-17T09:28:50+00:00 2026-06-17T09:28:50+00:00

I’ve got a bunch of input data, sometimes I get some garbage characters, example:

0

I’ve got a bunch of input data, sometimes I get some garbage characters, example:

âDots Baby Shower Invitationsâ

Clearly at some point in its past it was "Dots Baby Shower Invitations". But it came to me garbled. I’d be happy to just remove the garbage â characters in cases like this.

But my data set is very large, just removing all non english characters might be somewhat naïve, as in the case of the word naïve. I wouldn’t want ï to be removed of course.

So is there a potentially automated solution to this problem? Has anyone come before me with this issue? Is this a case of “computers aren’t as smart as humans”?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-17T09:28:50+00:00

You could use an english dictionary like WordNet and modify only the words that cannot be found in there.
For example naïve contains a “strange” character, but is in the dictionary, so it doesn’t get changed. âDots on the other hand, also contains a strange char, but won’t (hopefully) be in the dictionary, so it will be modified and the â will be deleted.

This might be too much effort, but as you said you needed a working solution fast maybe it’s worth a try… and it will probably work better that a quickly-hacked heuristic!

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’ve got a bunch of input data, sometimes I get some garbage characters, example:

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply