A matter of XML character
By Ed Tittel
Given that the XML specification says that XML is able to use the ISO 10646 Universal Multiple-Octet Coded Character set--a.k.a. the Universal Character Set (UCS), known as Unicode as well--I thought it might be interesting to cover some of the terminology and usage issues that working with Unicode in XML documents can sometimes entail. Please bear with me, were about to dive into a large bowl of alphabet soup full of acronyms of all kinds!
XML must deal with the following two forms of Unicode text encoding based on a technique called the Universal Transformation Format (UTF):
- UTF-16: The default way to encode Unicode characters is a 16-bit encoding. Using this technique, most characters are assigned a unique 16-bit value, called a character code. Unicode 16-bit encoding is the same as the ISO/IEC 10646 UTF-16 transformation format. When using UTF-16, characters with code values from 0 to 65,535 are encoded as single 16-bit values; characters with code values of 65,536 or greater are encoded as pairs of 16-bit values called surrogates. Basically, these exist to extend the space available to Unicode to a total of 31 bits' worth of data, which is currently believed to be sufficient for capturing all the world's known alphabets, glyphs, and ideograms) Using 4-byte codes, or extended UTF-16 surrogates, requires a 4-byte Universal Character Set (UCS) encoding called UCS-4; I mention this only because if you want to use it, DTD extensions written for the WebSGML Adaptions to ISO Standard 8879 (which define the ISO-Latin-1 through ISO-Latin-12 character sets) must be incorporated, so that the DTDs can legally contain numeric character codes big enough to represent 4-byte encodings.
- UTF-8: This technique provides a variable-length, byte-oriented way to encode character data, designed specifically for compatibility with ASCII based computing systems. Essentially, UTF preserves ASCII encodings for all character codes that are 7 bits in length or less.
Integrating Unicode characters outside the ASCII character set boundary of 0 to 126 requires that such characters not only be encoded into a sequence of anywhere from one to four bytes in length, but also that the values in those bytes be managed to properly convey the underlying data in an encoded form (for the details on the translation algorithm used, consult pg. 47 of The Unicode Standard Version 3.0, by the Unicode Consortium, Addison-Wesley.
Most XML, XHTML, or HTML documents that invoke Unicode based encoding schemes use the UTF-8 transformation format by default (this is the assumed encoding scheme if no explicit alternate encoding scheme is included in a document's XML declaration). It's important to note that UTF-8 is incompatible with so-called "higher-order" ASCII characters (those with character codes from 127 to 255). Fortunately, this means you can still use the same character entities you may have learned while using HTML and ISO-Latin-1. It also means you should become accustomed to using Unicode character codes for character entities, which you can create as å or as x00E5; to produce the lowercase a with a ring above it (the ISO-Latin-1 character entities for this are å and å).
For more great information on this topic, please visit one or more of the following online resources:
- The Unicode Consortium operates an extremely informative Web site at http://www.unicode.org/. You can find access to all kinds of specifications, technical information, and character set displays here.
- Dave Johnson at Boston University has posted an incredibly dense but informative resource called the "ISO 10646 Dictionary" wherein he defines all kinds of related terms, acronyms, and specifications. Check it out at http://cns-web.bu.edu/pub/djohnson/web_files/i18n/ISO-10646.html.
- Back in 1997, Rick Jelliffe created a DTD that defines named character entities for SGML or XML documents that use ISO-10646 character encodings. This is a useful external document to include in your work should you wish to take advantage of these definitions. You'll find the DTD at http://www.oasis-open.org/cover/xml-ISOents.txt
Although this may seem entirely anti-climactic, it's all this capability that lies behind the simple statement in an XML declaration that might read:
<?xml version="1.0" standalone="yes" encoding="UTF-8">
Now you know what lurks behind the final attribute value and can better appreciate what depths of representation it can deliver!
Ed Tittel is a principal at LANWrights, Inc., a wholly owned subsidiary of LeapIt.com. LANWrights offers training, writing, and consulting services on Internet, networking, and Web topics (including XML and XHTML), plus various IT certifications (Microsoft, Sun/Java, and Prosoft/CIW).