We need to get tree like structure from a given text document using Java. Used file type should be common and open (rtf, odt, …). Currently we use Apache Tika to parse plain text from multiple documents.
What file type and API we should use so that we could most reliably get the correct structure parsed? If this is possible with Tika, I would be happy to see any demonstrations.
For example, we should get this kind of data from the given document:
Main Heading
Heading 1
Heading 1.1
Heading 2
Heading 2.2
Main Heading is the title of the paper. Paper has two main headings, Heading 1 and Heading 2 and they both have one subheadings. We should also get contents under each heading (paragraph text).
Any help is appreciated.
OpenDocument (.odt) is practically a zip package containing multiple xml files. Content.xml contains the actual textual content of the document. We are interested in headings and they can be found inside text:h tags. Read more about ODT.
I found an implementation for extracting headings from .odt files with QueryPath.
Since the original question was about Java, here it is. First we need to get access to content.xml by using ZipFile. Then we use SAX to parse xml content out of content.xml. Sample code simply prints out all the headings:
Sample code:
public void printHeadingsOfOdtFIle(File odtFile) { try { ZipFile zFile = new ZipFile(odtFile); System.out.println(zFile.getName()); ZipEntry contentFile = zFile.getEntry("content.xml"); System.out.println(contentFile.getName()); System.out.println(contentFile.getSize()); XMLReader xr = XMLReaderFactory.createXMLReader(); OdtDocumentContentHandler handler = new OdtDocumentContentHandler(); xr.setContentHandler(handler); xr.parse(new InputSource(zFile.getInputStream(contentFile))); } catch (Exception e) { e.printStackTrace(); } } public static void main(String[] args) { new OdtDocumentStructureExtractor().printHeadingsOfOdtFIle(new File("Test3.odt")); }Relevant parts of used ContentHandler look like this:
@Override public void startElement(String uri, String localName, String qName, Attributes atts) throws SAXException { temp = ""; if("text:h".equals(qName)) { String headingLevel = atts.getValue("text:outline-level"); if(headingLevel != null) { System.out.print(headingLevel + " "); } } } @Override public void characters(char[] ch, int start, int length) throws SAXException { char[] subArray = new char[length]; System.arraycopy(ch, start, subArray, 0, length); temp = new String(subArray); fullText.append(temp); } @Override public void endElement(String uri, String localName, String qName) throws SAXException { if("text:h".equals(qName)) { System.out.println(temp); } }