I have a lot of .doc files with entry specifications for a database. I need to parse through all of these documents and create entries with the information within the documents. I have been trying to use the COM approach. The file has plain text on the top and at the bottom of the page… however, the specifications are in a table at the center of the page. If I don’t unlink the new .txt file I can see that the content is transfered to the new document, but it has a bunch of invalid characters in the form of [] running throughout it. When I use file_get_contents() it completely ignores all of the text from the table.
Is there someway to programmatically take care of this? I can’t really find any information on the API of the word.application COM object. Ideally I’m thinking I should strip the formatting then save the file as a .txt file or something to that effect.
Any help would be greatly appreciated.
Here is my code:
$dir = $PATH."/scripts/specsheets/doc";
$files = scandir($dir);
foreach( $files as $file ) {
if( strtolower(substr($file, -3)) == "doc" ) {
$word = new COM("word.application") or die("Unable to instantiate Word");
$word->Documents->Open($dir."/".$file);
$new_file = substr($dir."/txt/".$file, 0, -4).".txt";
$word->Documents[1]->SaveAs($new_file, 2);
$word->Documents[1]->Close(false);
$word->Quit();
$word = NULL;
unset($word);
$output = file_get_contents($new_file);
rename($dir."/".$file, $dir."/archive/".$file);
echo utf8_encode($output);
}
}
Can’t find a solution using the COM approach… but you can use the antiword program for Windows to get the output if you use this command in php
the link for the windows version is:
http://www-stud.rbi.informatik.uni-frankfurt.de/~markus/antiword/
It works very well, it even extracts the data in the tables. Definitely solved my issue.