OK, there are many HTML/XML parsers for Java. What I want to do is a bit more than just knowing how to parse it. I want to filter the content and have it in suitable form.
More precisely, I want to keep only the text and images. However, I want to preserve some of the text formatting, too, like: italic, bold, alignment, etc.
All this is for the reason that I’m trying to implement a converter that converts html to a specific format that I’ve created myself for my own purposes.
Any ideas? Surely, it must have been done many times before.
O.K. I think found it out: when parsing the
ElementI can construct ajavax.swing.text.html.InlineView, i.e.InlineElement ie = new InlineView(element)and then get the attributes asie.getAttributes).Right. If you could help more, i.e. have some first-hand experience to share, please do!