A customer is asking me to build a module for his running webapp that can load docx files and extract data based on the Headings found in the document. I know docx is just a zip file and most of what I need can be found in word/document.xml, though I’m not looking forward to parsing lists/styles/images/tables and whatever other things that need to be translated from OOXML to HTML.
Are there any PHP libraries for this format? I do need some sort of flexibility though: just an OOXML to HTML converter is not going to cut it, I need to break the document up in parts.
If it’s purely docx, you can try phpdocx… don’t know if it reads or only writes. PHPWord doesn’t yet read, only writes (though I’m working on it).
If you only need the properties information, then you’ll find it all within the /docProps/core.xml file within the zip (and possibly in /docProps/app.xml depending on exactly which properties you need), so you can bypass most of the files that hold text, style, images, etc. For verification of file names, [Content_Types].xml holds the filenames for the core and app properties files as application/vnd.openxmlformats-officedocument.spreadsheetml.sheet.main+xml and application/vnd.openxmlformats-officedocument.extended-properties+xml
EDIT:
If you need headings, then you will need to parse the document, not just the properties. That will mean identifying the heading styles, and parsing the text for entities with those styles.