Appendix B. About Unicode, UCS-2, and UTF-8

Table of Contents
ASCII: What everyone knows
ISO 8859: What everyone would like to forget
Unicode: East meets West
Unicode's Pluses and Minuses
Unicode Transformation Format: UTF-8
Unicode and FreeTDS

For better or worse, FreeTDS brings the otherwise innocent programmer into contact with the arcane business of how data are stored and transported. FreeTDS is a data communications library that of course connects to databases, which are charged with storing information in a way that is neutral to all architectures and languages. On the surface, that might not seem very complex, even worth discussing. Under the surface, things are not so simple.

ASCII: What everyone knows

The world we are all familiar with, programmingwise, is ASCII. Our email (mostly), our "text" files, our web pages (mostly), all use ASCII to represent English (or English-like) text. Perhaps because ASCII [1] was standardized back in 1972 by the ISO, it seems like the "natural" way to store information. But let's look under the hood a little bit, and examine our assumptions.

Our so-called "text" files are nothing special, nothing but a little agreement we enter into with our operating system. The only reason we can "read" them with cat or vi is that the operating system and its tools are in on the agreement. A file is only a stream of bytes, after all, no more "text" than an executable. The only thing distinguishing a "text" file from any other, is our understanding to treat it like one. We agree that the number 65 will represent the letter A, 66, B, and so on, 127 values in all. See man ascii for further details.

The important thing to understand is that the designation of 65 for A and so on is a choice. It's an encoding standard, made necessary by the old simple fact that computers store numbers, not letters. ASCII is so ubiquitous these days that it's hard sometimes to remember there was a time when it was but one of a set of competing encoding standards. Others you probably have heard of include EBCDIC and the Baudot systems, but they are by no means the only historical alternatives, nor the only modern ones.

The ASCII Compact

UNIX® and unix-like systems bought into ASCII big time. Program code, filenames, string constants (and variables), configuration files, everything but everything is encoded in ASCII. Practically every utility, command, and library assumes the "text" data will be ASCII. At the dawn of the 21st century, there is widespread recognition that ASCII will no longer suffice, but the art of upgrading all the computers and computer programmers is, well, an unfinished work.

Notes

[1]

czyborra.com is offline at the time of this writing (December 2003). It contained good information, so it's still included here, in case it comes back to life.