I was trying to download and parse a webpage with foreign (Chinese) characters. I’m not sure whether I should use “utf-8” or something else. But none of these seems to work for me. I used the sample Wikitionary code for getUrlContent().
public void onCreate(Bundle savedInstanceState) {
super.onCreate(savedInstanceState);
setContentView(R.layout.main);
mText = (TextView) findViewById(R.id.textview1);
huaren.prepareUserAgent(this);
String test = new String("fail");
try {
test = getUrlContent("http://huaren.us/");
} catch (ApiException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
byte[] b = new byte[100000];
try {
b = test.getBytes("utf-8");
} catch (UnsupportedEncodingException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
char[] charArr = (new String(b)).toCharArray();
CharSequence seq = java.nio.CharBuffer.wrap(charArr);
mText.setText(charArr, 0, 1000);//.setText(seq);
}
protected static synchronized String getUrlContent(String url) throws ApiException {
if (sUserAgent == null) {
throw new ApiException("User-Agent string must be prepared");
}
// Create client and set our specific user-agent string
HttpClient client = new DefaultHttpClient();
HttpGet request = new HttpGet(url);
request.setHeader("User-Agent", sUserAgent);
try {
HttpResponse response = client.execute(request);
// Check if server response is valid
StatusLine status = response.getStatusLine();
if (status.getStatusCode() != HTTP_STATUS_OK) {
throw new ApiException("Invalid response from server: " +
status.toString());
}
// Pull content stream from response
HttpEntity entity = response.getEntity();
InputStream inputStream = entity.getContent();
ByteArrayOutputStream content = new ByteArrayOutputStream();
// Read response into a buffered stream
int readBytes = 0;
while ((readBytes = inputStream.read(sBuffer)) != -1) {
content.write(sBuffer, 0, readBytes);
}
// Return result from buffered stream
return new String(content.toByteArray(), "utf-8");
} catch (IOException e) {
throw new ApiException("Problem communicating with API", e);
}
}
The charset is defined in the page itself:
In general, there are 3 ways to specify the encoding of an HTTP-server HTML page:
Content-Type header of HTTP
Encoding pseudo-attribute in the XML declaration
meta tag inside head
see Character Encodings for details
So you should try to evaluate each possible declaration in order to find the appropriate encoding. You could try to parse a page with utf-8 and restart if you encounter the Content-Type declaration meta tag.