I’ve used MS Word automation to save a .doc to a .htm. If there are bullet characters in the .doc file, they are saved fine to the .htm, but when I try to read the .htm file into a string (so I can subsequently send to a database for ultimate storage as a string, not a blob), the bullets are converted to question marks or other characters depending on the encoding used to load into a string.
I’m using this to read the text:
string html = File.ReadAllText(myFileSpec);
I’ve also tried using StreamReader, but get the same results (maybe it’s used internally by File.ReadAllText).
I’ve also tried specifying every type of Encoding in the second overload of File.ReadAllText:
string html = File.ReadAllText(originalFile, Encoding.ASCII);
I’ve tried all the available enums for the Encoding type.
Any ideas?
On my system (using US-English) Word saves *.htm files in the Windows-1252 codepage. If your system uses that codepage, then that is what you should read it in as.
It is also possible that whatever you are using the view the results may be creating the question marks for you, though, so be sure and check for that too.