Our programming team currently uses a database using Win1252 encoding, but the database is also not very good at filtering out bad data natively.
Quite often the end users of our programs simply copy+paste their information from MSWord to insert into our database which leads to all kinds of funky characters appearing in our database that occasionally can’t be interpreted.
Are there currently any libraries out there that would parse a string encoded with MSWord’s native encoding and translate it to similar ascii, UTF8 or Win1252 format?
By similar, I mean translating strange double quotes that look something like “ into the typical “.
Please inform me if my question is vague at all so I can update as necessary.
Ok, it appears that MSWord does use the Win1252 encoding – so I shouldn’t have too much of a hassle with saving copied+pasted text.
There is always the chance that users will copy+paste from differently encoded sources so the problem still exists. The best answers I could find on the internet refer to creating an encoding (Encoding ANSI = Encoding.GetEncoding(1252)) and then setting a ‘fallback’ – a replacement character for characters who’s encoding is not recognized(ANSI.EncoderFallback = new EncoderReplacementFallback(string.Empty);).
A helpful quote I found from another question from Stack Overflow was: “0x80 – 0x9F range in which the Windows-1252 code page differs from the ISO-8859-1 code page” which apparently is the origin for the majority of MSWord conversion problems.
If anyone came to this question whom isn’t using a 1252 encoded database (which I hope is the case as 1252 is terrible). The main problem with MSWord is the ‘smart quotes’ which it automatically changes regular quotes to. There are numerous solutions to this problem which can be found easily by simply googling ‘smart quotes’.
Hope this question/answer helps people with similar tedious problems that Microsoft throws at us.