I’m scraping Wikipedia pages with Java in order to extract information contained within infoboxes.
All works fine, except for the character encoding.
Wikipedia pages use “UTF-8” encoding.
The Ubuntu eclipse console uses “UTF-8” as default encoding as well.
However, the eclipse console shows some weird symbols when displaying information scraped. (e.g.:Smith · Ricardo instead of Smith · Ricardo)
This is the function I use to read data (it traverses all descendants of a node and join their text information at the end):
private String getTextContent(Node node) {
String text = "";
List<Node> children = null;
if (isTextNode(node)) {
return node.getNodeValue();
}
else if (!node.hasChildNodes()) {
return "";
}
else {
children = toList(node.getChildNodes());
for (Node childNode : children) {
text += getTextContent(childNode);
}
}
return text;
}
I forgot to mention that I’m using the JTidy library for scraping.
The console might be correctly interpreting UTF-8, but if you’ve got the wrong encoding when you read the data over the network, then you’re going to run into problems.
Specify UTF-8 as the encoding for JTidy to use.