Possible Duplicate:
Accented characters not correctly imported with BULK INSERT
A .net program running in my system provides me with a csv file. I would like to know the encoding of that file.
The csv file has é,ä,å,æ characters but is shown as �(UTF8-with BOM). Is there any possibility that I can bet back these characters to its original or its English like characters.
The csv file is created by a .net program running in the same machine under same user but after the creation of the file I cannot see the original characters.
sample data (UTF8-Without BOM) from csv file.
Pok�mon Black Version
TGC � Nintendo
on H�tel de R�ve
La Reine Masqu�e et la Tour des Miroirs
If you see
�, when you decode the file as UTF-8, but you see�, when you decode it as Windows-1252, then the file literally contains�. I.E. It literally contains the bytes0xEF 0xBF 0xBD(UTF-8 for�) . Therefore the data is unrecoverable at this point.This happens when physical encoding of some byte stream does not match the encoding used to decode it. So for instance, the physical encoding is Windows-1252, then a program decodes it to internal string using UTF-8 with replacement fallback. Now, the string internally contains
�, but it is not inspected and is written to a file as UTF-8, and the resulting file is what you have.To avoid the original screw up, it is a good idea to use exception fallback instead of replacement fallback when decoding files, for example:
Now you get an exception when the file isn’t UTF-8 and you can either try other encoding or let the user know that his file must be in UTF-8.