I am looking for a way to extract / scrape data from Word files into a database. Our corporate procedures have Minutes of Meetings with clients documented in MS Word files, mostly due to history and inertia.
I want to be able to pull the action items from these meeting minutes into a database so that we can access them from a web-interface, turn them into tasks and update them as they are completed.
Which is the best way to do this:
- VBA macro from inside Word to create CSV and then upload to the DB?
- VBA macro in Word with connection to DB (how does one connect to MySQL from VBA?)
- Python script via win32com then upload to DB?
The last one is attractive to me as the web-interface is being built with Django, but I’ve never used win32com or tried scripting Word from python.
EDIT: I’ve started extracting the text with VBA because it makes it a little easier to deal with the Word Object Model. I am having a problem though – all the text is in Tables, and when I pull the strings out of the CELLS I want, I get a strange little box character at the end of each string. My code looks like:
sFile = 'D:\temp\output.txt' fnum = FreeFile Open sFile For Output As #fnum num_rows = Application.ActiveDocument.Tables(2).Rows.Count For n = 1 To num_rows Descr = Application.ActiveDocument.Tables(2).Cell(n, 2).Range.Text Assign = Application.ActiveDocument.Tables(2).Cell(n, 3).Range.Text Target = Application.ActiveDocument.Tables(2).Cell(n, 4).Range.Text If Target = '' Then ExportText = '' Else ExportText = Descr & Chr(44) & Assign & Chr(44) & _ Target & Chr(13) & Chr(10) Print #fnum, ExportText End If Next n Close #fnum
What’s up with the little control character box? Is some kind of character code coming across from Word?
Word has a little marker thingy that it puts at the end of every cell of text in a table.
It is used just like an end-of-paragraph marker in paragraphs: to store the formatting for the entire paragraph.
Just use the Left() function to strip it out, i.e.
By the way, instead of
Try this: