Friday, February 02, 2007

UTF-8 ... and why you should use it.

I've recently discovered that for the sake of compatibility with various languages and internationalization, that UTF-8 should be used for everything. That might sound like
a very vanilla statement with nothing to back it up, but I really don't have time to go into
the reasons I've discovered for using UTF-8. Also, suffice it to say that it has now been
forced on me as a requirement, one which I don't mind adhering to, but it takes some
effort to convert. Case in point is using Hibernate as your persistence framework. If you're
using XDoclet2 to generate all your mapping files from annotations, the conversion is very simple. In your component definitions in your ant task, make the the following changes:
<component classname="org.xdoclet.plugin.hibernate.HibernateMappingPlugin" destdir="${basedir}/src" version="3.0">

and :
<component destdir="${src.dir}" classname="org.xdoclet.plugin.hibernate.HibernateConfigPlugin" jdbcdriver="${jdbc.driver}" jdbcpassword="${jdbc.password}" jdbcurl="${jdbc.url}" jdbcusername="${jdbc.username}" dialect="${hibernate.dialect}" cacheprovider="${hibernate.cache.provider_class}" cacheusequerycache="${hibernate.cache.use_query_cache}" jdbcpool="" jdbcisolation="${hibernate.connection.isolation}" showsql="${hibernate.show_sql}" version="${hibernate.version}" style=""/>

... and that's all there is to updating Hibernate to generate your files with UTF-8 encoding. However, if you're making a web app, that's not the only thing that you're going to have to change. All your JSP(X) files should start with the following :

<jsp:root jsp="http://java.sun.com/JSP/Page" version="2.0" c="http://java.sun.com/jsp/jstl/core" fmt="http://java.sun.com/jsp/jstl/fmt" spring="http://www.springframework.org/tags" display="urn:jsptld:http://displaytag.sf.net" authz="http://acegisecurity.org/authz">

<jsp:directive.page language="java" contenttype="text/html; charset=UTF-8" pageencoding="UTF-8">

<jsp:output declaration="false" element="html" public="-//W3C//DTD XHTML 1.0 Transitional//EN" system="http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

... or something very similar, the key point being that the encoding(s) specified are in UTF-8.
Also, since you're most likely using a database if you're using a web application, you'll need to update your database settings. One of the most common databases out there (and the one that I use) is MySQL. To permanently change the settings of MySQL to use the UTF-8 collation, you'll have to find your appropriate 'my.cnf' file and put this line:
default-character-set=utf8

...under the [client] and [mysqld] headings.

1 comment:

Alex Marshall said...

As an after thought, I realized that Blogger doesn't escape XML / HTML characters that you put in. So if you actually want to see the changes you're supposed to see on this post, you'll have to view the page source in your browser or otherwise download and view it.