I’m scraping Wikipedia pages with Java in order to extract information contained within infoboxes.

Question

0

Editorial Team

Asked: June 5, 20262026-06-05T04:09:07+00:00 2026-06-05T04:09:07+00:00

I’m scraping Wikipedia pages with Java in order to extract information contained within infoboxes.

0

I’m scraping Wikipedia pages with Java in order to extract information contained within infoboxes.

All works fine, except for the character encoding.
Wikipedia pages use “UTF-8” encoding.

The Ubuntu eclipse console uses “UTF-8” as default encoding as well.
However, the eclipse console shows some weird symbols when displaying information scraped. (e.g.:Smith Â· Ricardo instead of Smith · Ricardo)

This is the function I use to read data (it traverses all descendants of a node and join their text information at the end):

private String getTextContent(Node node) {
    String text = "";
    List<Node> children = null;     

    if (isTextNode(node)) {
        return node.getNodeValue();
    }
    else if (!node.hasChildNodes()) {
        return "";
    }
    else {
        children = toList(node.getChildNodes());
        for (Node childNode : children) {
            text += getTextContent(childNode);
        }
    }
    return text;
}

I forgot to mention that I’m using the JTidy library for scraping.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-05T04:09:09+00:00

Editorial Team

2026-06-05T04:09:09+00:00Added an answer on June 5, 2026 at 4:09 am

The console might be correctly interpreting UTF-8, but if you’ve got the wrong encoding when you read the data over the network, then you’re going to run into problems.

Specify UTF-8 as the encoding for JTidy to use.

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m scraping Wikipedia pages with Java in order to extract information contained within infoboxes.

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply