I am new in parsing HTML using Java. What I want to do is to get the text between tags but those tags contains some optional attributes.
for example, I have the folowing string
HelloWorld!
I want to extract the text of the second cell only which is “World!”. (and it has diffrent attributes from “Hello”)
What I have found here so far is:
import java.io.*;
import java.net.URL;
import javax.swing.text.*;
import javax.swing.text.html.*;
import javax.swing.text.html.parser.*;
public class HtmlParseDemo {
public static void main(String[] args) throws Exception {
Reader reader = new StringReader("<tr><td align=\"center\" width=\"408\"><font color=\"#000000\">"
+ "Hello </font></td><td align=\"center\" width=\"275\"><font color=\"#0000FF\">World! "
+ "</font></td></tr>");
HTMLEditorKit.Parser parser = new ParserDelegator();
parser.parse(reader, new HTMLTableParser(), true);
reader.close();
}
}
class HTMLTableParser extends HTMLEditorKit.ParserCallback {
private boolean encounteredATableRow = false;
public void handleText(char[] data, int pos) {
if (encounteredATableRow) {
System.out.println(new String(data));
}
}
public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) {
if (t == HTML.Tag.TD) {
encounteredATableRow = true;
}
}
public void handleEndTag(HTML.Tag t, int pos) {
if (t == HTML.Tag.TD) {
encounteredATableRow = false;
}
}
}
Output:
Hello
World!
It output all the text regardles the attributes.
Any ideas please?
I did it and it worked: