I’m working on a crawler that opens files and parses them and puts the content into a database.
However I’ve had a problem with files that contains odd chars, and I was wondering if there is any simple way to enforce ANSI-encoding of the string before I put it into the database, so to make sure that there is no illegal chars.
The project is written in C#, and the code I use to put stuff into the database is as following:
cmd = new OleDbCommand("INSERT INTO TaIndex (IndexId, IndexTekst, IndexDato, IndexModulId, IndexModul, IndexFilsti) VALUES (?, ?, ?, ?, ?, ?);", conn);
cmd.Parameters.Add("IndexId", OleDbType.Integer).Value = newIdGetter();
cmd.Parameters.Add("IndexTekst", OleDbType.LongVarChar).Value = Text;
cmd.Parameters.Add("IndexDato", OleDbType.Date).Value = DateTime;
cmd.Parameters.Add("IndexModulId", OleDbType.VarChar).Value = ModuleId;
cmd.Parameters.Add("IndexModul", OleDbType.VarChar).Value = Module;
cmd.Parameters.Add("IndexFilsti", OleDbType.VarChar).Value = ((object)FilePath) ?? DBNull.Value;
The problem is with the IndexTekst-field, which comes from the files.
Well, you could always check that the string can be encoded and then re-decoded to the same value:
Call that on each text field before saving it – and then consider what to do if it fails…
Is there any way you can change the database schema to accept all Unicode characters? That would be a far more pleasant approach, IMO.
If you do need to use some sort of ANSI encoding, you should work out exactly which encoding you mean. There are lots of encodings which are generally called “ANSI”. You need to work out which code page you mean.