05 August 2005

Character development

Representing characters for global software

When we used to write in C, we didn't often concern ourselves with the character set we worked with. In general you can write a string constant in the usual way like this: "Hello, world" - and the compiler converts it to an array of encoded bytes using the character set of the underlying platform. Because most computers in the western world use ASCII or a compatible derivative, this works just fine most of the time for those of us working in ASCII-speaking regions such as Britain and the US.

Things start to go wrong when you start to get a little more adventurous. For example, to check that a character is an upper-case alphabetic you might do something like this:

  if ((c >= 'A') && (c <= 'Z'))
{
...
}


This will work fine on a PC and many other systems, but what if you try to run this on an IBM AS/400 system? That system didn't use ASCII - it used a different standard called EBCDIC instead (and may still do, if there are any of these machines still around today). On such a machine some characters, such as '}' and 'ü', will pass the test in that example code as if they were upper-case alphabetic characters.

Ok, that's a contrived example. But what about writing code that will work in German, with umlauts over some of the vowels? Or Danish, which has 29 characters in the alphabet? Or in Korean or Arabic, which don't even use the same alphabets?

Java handles this whole problem by working internally with Unicode no matter what the underlying platform uses. Unicode is a character encoding that's designed to allow any characters in use anywhere in the world to be represented. The problem from a developer perspective is how to get those Unicode characters in and out of your applications, because most systems as yet don't use Unicode as a native format for storing files or displaying text on the screen.

Mostly you don't have to worry about this. The InputStreamReader and OutputStreamWriter classes, for example, convert characters between Unicode and the platform's default character set without you needing to take explicit action.

But in today's networked world, many applications aren't designed to handle just one system - you may be developing client/server applications. What if your server is running on a system where the platform character set is Latin-9, and the client is on a Latin-1 system? The Unicode in the client application will be converted to Latin-1 and sent to the server, but the server will interpret the bytes as Latin-9. Since the mappings between Latin-1 and Latin-9 don't quite line up, some characters from the client will be wrong by the time they get into the server as Unicode.

As it happens this is quite easy to fix. InputStreamReader and OutputStreamWriter both have constructors that will let you specify the encoding you want to use. In this case you could set the encoding on both sides to be Latin-9 explicitly by putting something like this in the client and server sides:

  String charset = "ISO-8859-15"; // a.k.a. Latin-9
reader = new BufferedReader(new InputStreamReader(socket
.getInputStream(), charset));
writer = new BufferedWriter(new OutputStreamWriter(socket
.getOutputStream(), charset));


Now it doesn't matter what the default encoding of the client and server platforms are; your applications will communicate using Latin-9 regardless.

Should you use Latin-9? Probably not. For one thing, the Java specification defines a minimal set of encodings that all Java implementations must support, and Latin-9 isn't one of these. A better choice is probably UTF-8 - this is mandated in the specification so all Java runtimes must support it, and better still it allows all Unicode characters to be encoded and transferred so your application would be able to handle Russian, Greek and even Arabic and Japanese characters.

Labels:

0 Comments:

Post a Comment

Subscribe to Post Comments [Atom]



<< Home