Some of our users use e-mail clients that can’t cope with Unicode, even when the encoding, etc. are properly set in the mail headers.
I’d like to ‘normalise’ the content they’re receiving. The biggest problem we have is users copy’n’pasting content from Microsoft Word into our web application, which then forwards that content by e-mail – including fractions, smart quotes, and all the other extended Unicode characters that Word helpfully inserts for you.
I’m guessing there is no definitely solution for this, but before I sit down and start writing great big lookup tables, is there some built-in method that’ll get me started?
There’s basically three phases involved.
First, stripping accents from otherwise-normal letters – solution to this is here
This paragraph contains “smart quotes” and áccénts and ½ of the problem is fractions
goes to
This paragraph contains “smart quotes” and accents and ½ of the problem is fractions
Second, replacing single Unicode characters with their ASCII equivalent, to give:
This paragraph contains "smart quotes" and accents and ½ of the problem is fractions
This is the part where I’m hoping there’s a solution before I implement my own. Finally, replacing specific characters with a suitable ASCII sequence – ½ to 1/2, and so on – which I’m pretty sure isn’t natively supported by any kind of Unicode magic, but somebody might have written a suitable lookup table I can re-use.
Any ideas?
Thank you all for some very useful answers. I realize the actual question isn’t “How can I convert ANY Unicode character into its ASCII fallback” – the question is “how can I convert the Unicode characters my customers are complaining about into their ASCII fallbacks” ?
In other words – we don’t need a general-purpose solution; we need a solution that’ll work 99% of the time, for English-speaking customers pasting English-language content from Word and other websites into our application. To that end, I analyzed eight years’ worth of messages sent through our system looking for characters that aren’t representable in ASCII encoding, using this test:
I’ve then been through the resulting set of unrepresentable characters and manually assigned an appropriate replacement string. The whole lot is bundled up in an extension method, so you can call myString.Asciify() to convert your string into a reasonable ASCII-encoding approximation.
Note that there are some rather odd fallbacks in there – like this one:
That’s because one of our users has some program that converts open/close smart-quotes into ² and ³ (like : he said ²hello³) and nobody has ever used them to represent exponentiation, so this will probably work quite nicely for us, but YMMV.