I am using C# and Microsoft Word 12.0 object library to read data from .doc file and then save these content to a text file (This is required by my Project). My .doc file have some tables and I need to read each row and column in such tables.
The reading operations were executed successfully, but the data contains some strange characters (like square ones) as in the attached image

Here is the code I used:
private void btnRead_Click(object sender, EventArgs e)
{
try
{
Microsoft.Office.Interop.Word.ApplicationClass wordObject = new ApplicationClass();
object file = textBox1.Text; //this is the path
object nullobject = System.Reflection.Missing.Value;
Microsoft.Office.Interop.Word.Document docs = wordObject.Documents.Open
(ref file, ref nullobject, ref nullobject, ref nullobject,
ref nullobject, ref nullobject, ref nullobject, ref nullobject,
ref nullobject, ref nullobject, ref nullobject, ref nullobject,
ref nullobject, ref nullobject, ref nullobject, ref nullobject);
docs.ActiveWindow.Selection.WholeStory();
docs.ActiveWindow.Selection.Copy();
IDataObject data = Clipboard.GetDataObject();
String allData = "";
for (int t = 1; t < docs.Tables.Count; t++ )
{
Table tbl = docs.Tables[t];
for (int r = 1; r < tbl.Rows.Count; r++)
{
for (int c = 1; c < 3; c++)
{
allData += tbl.Cell(r, c).Range.FormattedText.Text.Trim() + Environment.NewLine;
}
}
}
txtData.Text = allData;
saveTextFile(allData);
docs.Close(ref nullobject, ref nullobject, ref nullobject);
}
catch (Exception j)
{
MessageBox.Show(j.Message);
}
}
private void saveTextFile(String data)
{
try
{
StreamWriter sw = new StreamWriter(txtOutput.Text.Trim());
sw.WriteLine(data);
sw.Flush();
sw.Close();
}
catch (Exception ex)
{
MessageBox.Show(ex.StackTrace);
}
}
Does anyone have any ideas how can I remove such strange characters, please?
Well, I’m not very familiar with the doc format specifically, but those boxes (the “strange characters”) are generally displayed when there is a character present that is outside of the printable character set. In this case, since there are always two of them at the end of a line, it might be related to newline characters in the document (or some newline-related parsing error), like \r\n. \r\n is commonly present in many Windows-formatted documents, though whether this is the case in .doc documents is beyond my expertise.
Of course, removing them should be relatively trivial if you’re happy to hack it. You could simply add a check that just deletes the last two characters of every line. It’s not pretty (and I’d probably recommend against it just on principle) but it appears that it would work.