I am using JSOUP (java tool for XML files) and I am using following code to read an URL that is saved in a XML file. here are my codes:
Document d = Jsoup.parse(new File("feed.xml"), null);
Element elementCat = d.getElementsByTag("cat").get(0);
String stringUrl = elementCat.ownText();
System.out.println(stringUrl);
the XML input file is like this:
<?xml version="1.0" encoding="utf-8" standalone="no"?>
<root>
<cat>http://www.isna.ir/ISNA/FullNews.aspx?SrvID=Event&Lang=P</cat>
</root>
my problem is that the output of program is this:
http://www.isna.ir/ISNA/FullNews.aspx?SrvID=Event⟪=P
instead of this:
http://www.isna.ir/ISNA/FullNews.aspx?SrvID=Event&Lang=P
In other words, it converts “&Lang” to “⟪” automatically.
Please pay attention that it is not “⟪”, it’s just “&Lang” without semicolon.
I want to disable encoding or escaping and I want the raw data.
How can I solve this problem?
You’ve got a piece of XML. In XML, there’s a manner of escaping markup, since sometimes you just need a piece of text containing
<or an attribute with"in its value. Escaping is done using a character entity reference, which starts with an ampersand, followed by a code, followed by a semi-colon. Like so:<. That can represent<.Of course, that leaves us with the problem of the ampsersand itself. If it’s actually an ampersand you need, rather than some different character entity, you’ll have to encode it thus:
&.What you’ve got there is XML that isn’t well-formed. The
&indicates you’re starting a character entity reference, but then it getsLang. Now, maybe jsoup doesn’t make much of a problem of this. But that’s because it’s for HTML parsing and not XML. Since HTML is a bit more lenient than XML, I suppose jsoup simply subtitutes what it takes to be an unknown character reference with something else. Likely anulcharacter.So make sure the XML is well-formed. If that can’t be done, don’t treat it as XML but as HTML. If XML processing is what you’re after, look into SAX, StAX, DOM or JAXB.