When I run the code below I received:
[Fatal Error] :1:1: Content is not allowed in prolog.
org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is not allowed in prolog.
I know the string html has not allowed content but I would like to suppress all errors.
import java.io.ByteArrayInputStream;
import java.io.InputStream;
import org.w3c.dom.*;
import org.xml.sax.InputSource;
import javax.xml.xpath.*;
import javax.xml.parsers.*;
public class Test {
public static void main(String[] args){
String html="---<html><div id='teste'>Teste</div><div id='ola'>Ola tudo ebm!</div></html>";
try{
XPath xpath = XPathFactory.newInstance().newXPath();
String xpathExpression = "//div[@id='ola']";
InputStream is = new ByteArrayInputStream(html.getBytes());
InputSource inputSource = new InputSource(is);
NodeList nodes = (NodeList) xpath.evaluate
(xpathExpression, inputSource, XPathConstants.NODESET);
int j = nodes.getLength();
for (int i = 0; i < j; i++) {
System.out.println(nodes.item(i).getTextContent());
}
} catch (Exception e) {
e.printStackTrace();
}
}
}
First, XML is not the same as HTML, and XPath works on the XML data model.
In order to solve this, you will have to find some other way of parsing your input stream, because when you parse that string, the parser that is invoked is an XML parser, and XML parsers do not have an “ignore errors” option by definition. Only valid input is allowed; the very specification of the parser says that ill formed input should cause a fatal exception.
So an alternative would be to use a different parser. There are several out there. For example, you could use JTidy. Although it parser HTML into an HTML DOM, with a little bit of glue code you can convert that so it is suitable for parsing. See Question 3361263, Library to query HTML with XPath in Java.