I have a Microsoft Word Doc that was saved as a .htm web page. Below is the code I have. My question is how can I get the text from the document, and append it to a string. I noticed that the paragraph is set to a tag <p class=MsoNormal> so any suggestions. The string I want to append it to is documentText
String documentText = "";
FileInputStream fileInput = null;
BufferedInputStream myBuffer = null;
DataInputStream dataInput = null;
fileInput = new FileInputStream(selectedFile);
myBuffer = new BufferedInputStream(fileInput);
dataInput = new DataInputStream(myBuffer);
while (dataInput.available() != 0){
System.out.println(dataInput.readLine());
}
Take a look at libraries such as HTML Parser and Jericho HTML Parser or use the native HTMLEditorKit.Parser + HTMLEditorKit.ParserCallback approach suggested on this answer.