Character Matters

Track: Core Technologies

Audience Level: Technical View

Time: Tuesday, November 16 at 11:45

Author: Diederik Gerth van Wijk , Content Architect, Kluwer

Keywords: DSDL, Unicode, SGML, Publishing

Abstract:

Documents are made of characters, XML documents are made of Unicode characters. Comparing with SGML, we now have potentially one million characters while SGML only provides a hundred, but on the other hand, we lost the option of defining our own SDATA entities. This puts us to two challenges.

The first is, how can we validate that a document, an element, an attribute only contains those characters that we know how to process, how to render, sort, seek, hyphenate, capitalise, pronounce... How can we tell a type setter for which character set he has to find a font? XML Schema provides a simple way of restricting the set of valid characters in an attribute or a simple elememt to a regular expression, that can use some of the Unicode character properties, like the block it is defined in (like Basic Latin or Latin Extended-B) or the General Category (like Uppercase Letter or Math Symbol), but you can't use that in mixed content, like is typical in text markup.

The second challenge is that Unicode can't provide all characters that one needs. New characters will be invented, and since a proof of their existence is needed before they can be added to Unicode, one needs a way of encoding them for the time being. And some characters will by definition not be added to Unicode: the difference between a diaeresis and an umlaut is considered not to exist in Unicode, but is relevant for processing. The way to still do that in Unicode is to use the private code areas, but how does a document tell what his private character U+E000 means? It's the first private use code character and I'm sure that in my documents I'll use it for something different than you. Part 7 of DSDL, the Document Schema Defintion Languages, will provide a way of defining a character repertoire that can be used to validate the characters used in mixed content as well. In order to be manageable, reusable and extensible, these character repertoires will be identifiable sets, with a name, and one should be able to add his own private character to an existing public set, so that if one defines a character for j with acute, and says it's member of the set of Dutch characters, it will be allowed in all contexts that allow that set. And not only that, one should be able to define how to sort, render and seek for that character. And one should be able to redefine these rules depending on the context, like an xml:lang attribute.

The presentation ends with a plead for a Bottom Up Constraint Language, and for adding all information on how to process a character or an element into that definition. BUCL UP!