I am using this servlet to extract the HTML contents from another domain to include in my own page with Ajax, it specifies the response as “UTF-8”:
public class ProxyServlet extends HttpServlet {
public void doGet(HttpServletRequest request, HttpServletResponse response)
throws ServletException {
String urlString = request.getParameter("url");
try {
URL url = new URL(urlString);
url.openConnection();
BufferedReader reader = new BufferedReader(new InputStreamReader(url.openStream()));
response.setContentType("text/html; charset=UTF-8");
PrintWriter out = new PrintWriter(new OutputStreamWriter(response.getOutputStream(), "UTF8"), true);
char[] buf = new char[4 * 1024];
int len;
while ((len = reader.read(buf, 0, buf.length)) != -1) {
out.write(buf, 0, len);
}
out.flush();
}
catch (MalformedURLException e) {
throw new ServletException(e);
}
catch (IOException e) {
throw new ServletException(e);
}
}
}
The document I am extracting has a meta tag like this:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"></meta>
I copied and pasted it onto my own page so it matches exactly. According to the browser page info it is definitely using “UTF-8” encoding. Yet I am still getting “” instead of “ ” in the extracted html contents.
They are actually contained in the responseText from this ProxyServlet. I thought explicitly defining the response content type and output stream charset would handle this but I must be missing something? Has anyone resolved this before.
Instead of converting a byte stream to chars and vice versa you could just copy from ony bytes stream to another via a byte[] buffer. Or use a Spring util:
or explicitly:
It would guarantee that data is copied as is (without possible screwing things up via wrong chars)