I have some code that uses the Java Apache POI library to open a

Question

0

Asked: June 13, 20262026-06-13T16:00:24+00:00 2026-06-13T16:00:24+00:00

I have some code that uses the Java Apache POI library to open a

0

I have some code that uses the Java Apache POI library to open a Microsoft word document and convert it to html, using the the Apache POI and it also gets the byte array data of images on the document. But I need to convert this information to html to write out to an html file. Any hints or suggestions would be appreciated. Keep in mind that I am a desktop dev developer and not a web programmer, so when you make suggestions, please remember that. The code below gets the image.

 private void parseWordText(File file) throws IOException {
      FileInputStream fs = new FileInputStream(file);
      doc = new HWPFDocument(fs);
      PicturesTable picTable = doc.getPicturesTable();
      if (picTable != null){
           picList = new ArrayList<Picture>(picTable.getAllPictures());
           if (!picList.isEmpty()) {
           for (Picture pic : picList) {
                byte[] byteArray = pic.getContent();
                pic.suggestFileExtension();
                pic.suggestFullFileName();
                pic.suggestPictureType();
                pic.getStartOffset();
           }
        }
     }

Then the code below this converts the document to html. Is there a way to add the byteArray to the ByteArrayOutputStream in the code below?

private void convertWordDoctoHTML(File file) throws ParserConfigurationException, TransformerConfigurationException, TransformerException, IOException {
    HWPFDocumentCore wordDocument = null;
    try {
        wordDocument = WordToHtmlUtils.loadDoc(new FileInputStream(file));
    } catch (IOException ex) {
        Exceptions.printStackTrace(ex);
    }

    WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument());
    wordToHtmlConverter.processDocument(wordDocument);
    org.w3c.dom.Document htmlDocument = wordToHtmlConverter.getDocument();
    NamedNodeMap node = htmlDocument.getAttributes();


    ByteArrayOutputStream out = new ByteArrayOutputStream();
    DOMSource domSource = new DOMSource(htmlDocument);
    StreamResult streamResult = new StreamResult(out);

    TransformerFactory tf = TransformerFactory.newInstance();
    Transformer serializer = tf.newTransformer();
    serializer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
    serializer.setOutputProperty(OutputKeys.INDENT, "yes");
    serializer.setOutputProperty(OutputKeys.METHOD, "html");
    serializer.transform(domSource, streamResult);
    out.close();

    String result = new String(out.toByteArray());
    acDocTextArea.setText(newDocText);

    htmlText = result;

}

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-13T16:00:25+00:00

Looking at the source code for the org.apache.poi.hwpf.converter.WordToHtmlConverter at

http://svn.apache.org/viewvc/poi/trunk/src/scratchpad/src/org/apache/poi/hwpf/converter/WordToHtmlConverter.java?view=markup&pathrev=1180740

It states in the JavaDoc:

This implementation doesn’t create images or links to them. This can be
changed by overriding {@link #processImage(Element, boolean, Picture)} method

If you take a look at that processImage(...) method in AbstractWordConverter.java at line 790, it looks like the method is calling then another method named processImageWithoutPicturesManager(...).

http://svn.apache.org/viewvc/poi/trunk/src/scratchpad/src/org/apache/poi/hwpf/converter/AbstractWordConverter.java?view=markup&pathrev=1180740

This method is defined in WordToHtmlConverter again and looks suspiciously exact like the place you want to grow your code (line 317):

@Override
protected void processImageWithoutPicturesManager(Element currentBlock,
    boolean inlined, Picture picture)
{
    // no default implementation -- skip
    currentBlock.appendChild(htmlDocumentFacade.document
    .createComment("Image link to '"
    + picture.suggestFullFileName() + "' can be here"));
}

I think you have the point where to start inserting the images into the flow.

Create a subclass of the converter, e.g.

    public class InlineImageWordToHtmlConverter extends WordToHtmlConverter

and then override the method and place whatever code into it.

I haven’t tested it, but it should be the right way from what I see theoretically.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have some code that uses the Java Apache POI library to open a

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply