We are building a (Java) web project with Eclipse. By default Eclipse uses Cp1252 encoding on Windows machines (which we use).
As we also have developers in China (in addition to Europe), I started to wonder if that is really the encoding to use.
My initial thought was to convert to UTF-8, because “it supports all the character sets”. However, is this really wise? Should we pick some other encoding instead? I see couple of issues:
1) How do web browser interpret the files by default? Does it depend on what language version one is using? What I am after here is that should we verbosely declare the encoding schemes used:
- XHTML files can set the encoding verbosely using
<?xml version='1.0' encoding='UTF-8' ?>declarations. - CSS files can set this by
@CHARSET "UTF-8";. - JavaScript files do not have in-file declarations, but one can globally define
<meta http-equiv="Content-Script-Type" content="text/javascript; charset=utf-8">or<script type="text/javascript" charset="utf-8">for specific scripts.
What if we leave CSS file without @CHARSET "UTF-8"; declaration? How does the browser decide how it is encoded?
2) Is it wise to use UTF-8, because it is so flexible. By locking our code into Cp1252 (or maybe ISO-8859-1) I can ensure that foreign developers don’t introduce special characters into files. This effectively prevents them from inserting Chinese comments, for example (we should use 100% english). Also, allowing UTF-8 can sometimes allow developers accidentally introduce some strange characters, that are difficult/impossible to perceive with human eye. This occurs when people, for example, copy-paste text or happen to press some weird keyboard combination accidentally.
It would seem that allowing UTF-8 in the project just brings problems…
3) For internatioanlization, I initially considered UTF-8 a good thing (“how can you add translations if the file encoding doesn’t support the characters one needs?”). However, as it turned out, Java Resource Bundles (.properties files) must be encoded with ISO-8859-1, because otherwise they might break. Instead, the international characters are converted into \uXXXX notation, for example \u0009 and the files are encoded with ISO-8859-1. So… we are not even able to use UTF-8 for this.
For binary files… well, the encoding scheme doesn’t really matter (I suppose one can say it doesn’t even exist).
How should we approach these issues?
Go for it. You want world domination.
It uses the
Content-Typeresponse header for this (note, the real response header, not the HTML meta tag). I see/know that you’re a Java developer, so here are JSP/Servlet targeted answers: setting<%@page pageEncoding="UTF-8" %>in top of JSP page will implicitly do this right and settingresponse.setCharacterEncoding("UTF-8")in Servlet/Filter does the same. If this header is absent, then it is entirely up to the browser to decide/determine the encoding. MSIE will plain use the platform default encoding. Firefox is a bit smarter and will guess the encoding based on page content.I would just writeup a document describing team coding conventions and spread this among developers. Every self-respected developer know that s/he risk to get fired when not adhering this.
This is solved since Java 1.6 with new
Properties#load()method taking aReaderand the newResourceBundle.Controlclass wherein you can control the loading of the bundle file. In JSP/Servlet terms, usually aResourceBundleis been used. Just set the message bundle name to the full qualified classname of the customResourceBundleimplementation and it will be used.The encoding is indeed only interesting whenever one want to convert computer readable binary data to human readable character data. For "real" binary content it indeed doesn’t make any sense since the binary format doesn’t represent any sensible character data.
See also:
ResourceBundle.Controlexample