I currrently have some code that converts a .doc document to html but the code I am using for converting a .docx to text unfortunately doesn’t get the text and convert it. Below is my code.
private void convertWordDocXtoHTML(File file) throws ParserConfigurationException, TransformerConfigurationException, TransformerException, IOException {
XWPFDocument wordDocument = null;
try {
wordDocument = new XWPFDocument(new FileInputStream(file));
} catch (IOException ex) {
Exceptions.printStackTrace(ex);
}
WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument());
org.w3c.dom.Document htmlDocument = wordToHtmlConverter.getDocument();
ByteArrayOutputStream out = new ByteArrayOutputStream();
DOMSource domSource = new DOMSource(htmlDocument);
StreamResult streamResult = new StreamResult(out);
TransformerFactory tf = TransformerFactory.newInstance();
Transformer serializer = tf.newTransformer();
serializer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
serializer.setOutputProperty(OutputKeys.INDENT, "yes");
serializer.setOutputProperty(OutputKeys.METHOD, "html");
serializer.transform(domSource, streamResult);
out.close();
String result = new String(out.toByteArray());
acDocTextArea.setText(newDocText);
String htmlText = result;
}
Any ideas as to why this isn’t working would be much appreciated. The ByteArrayOutput should return the entire html but it is empty and has no text.
Mark, you’re using HWPF package which supports only
.docformat, see this description. The document also mentions attempts to provide the interface for.docxfiles, through XWPF package. However they seem to lack human resources and users are encouraged to submit extensions. Limited functionality should be available though, extracting the text must be one of them.You should also see this question: How to Extract docx (word 2007 above) using apache POI.