Abstract
For XML, the backbone for world-wide document and data encoding, transmission, and processing, it is critical to be usable all around the world quickly and easily. Good internationalization is easily taken for granted, but bad or missing internationalization becomes extremely annoying. The paper discusses the success of XML internationalization, the internationalization topics currently being worked on, and the challenges for the future.
Keywords
Table of Contents
In today's global business environment, it is critical that information technology is usable quickly and easily all around the world. Good internationalization is most often taken for granted and and often goes unnoticed, but bad or missing internationalization is extremely annoying. This is of particular importance for a technology such as XML, which is the backbone for world-wide document and data encoding, transmission, and processing.
Traditionally, products and standards were localized, i.e. adapted to the particular needs of a language or region. Over time, it was realized that a fair amount of effort could be shared among different localizations. This led to internationalization, i.e. preparing a product for various localizations in advance. Still, in many cases, internationalization was an afterthought for products and standards. Successful companies and standards organizations, however, have learned that internationalizing standards and products from the start quickly pays off, because:
It avoids expensive and clumsy retrofitting at a later stage.
It avoids or reduces delays in product rollouts world-wide.
It prevents piecemeal approaches in different regions that lead to interoperability and maintenance nightmares.
Because of the 'internationalization-readyness' of basic technology infrastructure (including, of course, XML), it is becoming easier and less expensive.
This paper discusses the work on internationalization of XML and related technologies with a somewhat chronological view: What has been done right in the past, and why is it done much better in XML than in other technologies? What chances have been missed, if any? What is currently being worked on and will be available soon? What are the problems we still have to deal with in the long term? Understanding these issues will help the audience to fully leverage the benefits of various XML technologies in a world-wide context.
The Internationalization Activity of the W3C (see [Activity]) is coordinating internationalization efforts at W3C. It currently has a working group and an interest group. The working group consists of three task forces. The Core Task Force is working together with other working groups, mostly through reviews, to assure adequate internationalization of W3C specifications. It is also responsible for the Character Model [CharacterModel] and the IRI specification [IRIdraft]. The Guidelines, Education & Outreach (GEO) Task Force works on guidelines and techniques for using internationalization, such as [AuthoringTech]. The Web Services Task Force concentrates on internationalization aspects of Web services, and is working on a use case document [WSUsage]. New participants in either the working group or the interest group are welcome.
The most basic topic that any kind of software internationalization has to address is character encoding. For XML as a text-based format, this was particularly important. Based on earlier work for SGML and HTML, XML adopted a model that can be summarized very concisely as: "identify the character encoding; think in Unicode". The principles of this approach are now being formalized in the W3C Character Model [CharacterModel]. This was an enormous step ahead from the previous way of having technology that could only work in a single local encoding at a time. The availability of the Universal Character Set (UCS)(i.e. Unicode/ISO 10646) as a common reference was a crucial precondition for this step. But this step was also both necessary and highly successful to assure world-wide interoperability and the ability to combine and process data from many different sources.
It is the author's hope that in the long term, we will in many areas and technologies again converge on a single (but this time UCS-based and therefore universally usable) encoding. For external data interchange, the best candidate is UTF-8. At the time XML was created, it was too early for such a step, but XML made a step in the right direction by requiring every XML processor to accept UTF-8 (and UTF-16), and by using UTF-8 as the default encoding for unmarked files (while UTF-16 is required to be marked by a Byte Order Mark (BOM)).
XML also has a clearly defined way of documents to self-identify their character encoding based on internal labeling and a simple bootstrapping procedure. The principle of using a single character encoding per external entity makes sure that it is easy to treat XML with widely available generic text tools. In encodings with a restricted character repertoire, a Numeric Character Reference (NCR) can be used to represent a character outside the repertoire in element and attribute content [1]. With a certain regularity, there are requests for better support or a larger repertoire of (named) character entities, rather than just numeric character entities. However, with the progress of operating systems providing more and more language-specific keyboards as well as generic user-friendly input methods, allowing to input the actual character rather than a placeholder, this should become less and less necessary.
One way to understand the huge advantages that XML provides for internationalization in the area of character encoding is to contrast it with how some other formats handle character encoding. Email (MIME) headers are a particularly unfortunate example. They mangle non-ASCII characters in many different ways, allowing various encodings to be mixed in the same header line, not giving any guarantees for any encodings to be understood by recipients, using both quoted-printable and base64 for binary encoding, and adding to that punycode for the encoding of internationalized domain names, and (possibly in some variant) for the internationalization of the left-hand parts of email addresses. For MIME, in the case of US-ASCII, any kinds of processing is very easy, and the full gamut of text processing tools can be used. For non-ASCII characters, even the simplest operations such as search/replace are hopelessly difficult. For XML, all the data is textual, and can be processed with equal ease with generic tools.
Text normalization, very different from character encoding, is a rather advanced topic where we have made progress in the past few years, but where a lot of work still needs to be done.
The original publication of XML contained the following as a definition:
match (Of strings or names:) Two strings or names being compared must be identical. Characters with multiple possible representations in ISO/IEC 10646 (e.g. characters with both precomposed and base+diacritic forms) match only if they have the same representation in both strings. At user option, processors may normalize such characters to some canonical form. No case folding is performed.
This turned out to be unclear, for the following reasons:
What canonical form should be used?
Is the optional normalization carried out before or after parsing?
Does this make sense if we cannot rely on it being available in implementations? Being used in processing?
It was quickly realized that a more careful and coordinated approach was needed. This lead to a requirements document on string identity matching and string indexing ([CharReq]), which emphasized the need for early uniform normalization and a well-defined (pre)composed normalization form. Normalization Form C (NFC) was then defined in [UAX#15].
The idea of early uniform normalization is simple: By having everybody use the normalization form that almost everybody is already using anyway, repeated normalization operations for equality tests are avoided. By making it the responsibility of the originator of some text to normalize, different behavior in differing implementations leading to security problems are avoided. By designing a normalization form that is as close as possible to already established practice (NFC), the effort for conversion is minimized. Unfortunately, there was a countereffect to this: Because most texts are already normalized in most cases, problems are rarely noticeable, and therefore the motivation for addressing text normalization is often too low.
Currently, the discussion is focusing mainly on the degree an the means by which NFC should be checked or enforced on the Web. See [HowNormalization] for a discussion with colorful but somewhat farfetched examples. Implementations both of normalization and of normalization checking are becoming available. As an example, a compact implementation of NFC checking is available at [NFCcheck], and is integrated in Richard Tobin's RXP ([RXP]) to support the text normalization provisions of XML 1.1.
XML 1.0 also had and has a pioneer role for the internationalization of identifiers. By identifiers, we mean short strings that are used both as protocol elements for automated operations as well as by humans. XML allows the use of non-ASCII element and attribute names, a feature that is not that important in production systems, but all the more in education. It also for the first time introduced what is now known as International Resource Identifier (IRI). IRIs are gaining attention in the context of IDN.
There is no doubt about the need to be able to use the languages and scripts of the world for document and data content. When it comes to identifiers, however, the situation is less clear. On the one hand, limiting identifiers to characters from a very small and widely known set has the advantages that they can be handled even with older or outdated technology, as well as (with varying effort), by anybody around the globe. On the other hand, identifiers in a native language and script are easier to devise, to memorize, to guess, to understand, to manipulate, to correct, and to identify with for people familiar with that script and language (see [IRI2001]). This can for example be seen by the very widespread use of such identifiers to identify documents on office systems.
To understand what it means to use identifiers based on the Latin script for people who do not use the Latin script as their native script, imagine that you had to use Greek characters for element and attribute names as well as URIs and other identifiers. Even if most of us are familiar with the Greek alphabet from classics, physics, and other occasions, this would make it significantly harder to do everyday work.
XML allows the use of non-ASCII element and attribute names[2]. In theory, this was already possible in SGML, but neither the average SGML declaration nor the average SGML implementation actually provided it. In XML, it has been built in from the start.
While the author of this paper does not know of any major production DTD or schema using non-ASCII element or attribute names, this feature has shown to be invaluable for instruction about XML. Most introductory books about XML in Japanese, for example, use Japanese element and attribute names in their examples. This provides for a much more direct, immersive approach to understanding the idea of markup and the correspondence between abstract concepts and syntactical constructs. This is taken for granted by English readers, so again imagining using Greek characters will help to see the benefits.
Over the years, there has been some discussion about creating a mechanism to support parallel, translated DTDs/schemas directly. Currently, it is felt that user-agent-specific mechanisms or solutions based on transforms (using XSLT) should be good enough.
In the tradition of SGML, XML 1.0 defined strict rules for characters allowed in names. These rules were based on the character repertoire of Unicode 2.0. The addition of more scripts and characters in new versions of Unicode, meant that such scripts (e.g. Khmer, Mongolian, Ethiopic,...) could be used in content, but not in identifiers. This is one of the reasons for XML 1.1. The approach taken in XML 1.1 for name characters is more open-ended. Except for specific blocks, the parser does no longer reject unassigned codepoints. This introduces the (theoretical) risk for use of unassigned codepoints in element and attribute names, but avoids repeated updates of specification and implementations to cover newly encoded scripts.
XML 1.0 is also the first specification that officially uses what is now called IRIs, for system identifiers. An IRI is the internationalized equivalent of an URI. XLink (with its href attribute) and XML Schema (with its anyURI datatype) followed the lead of XML 1.0. The core of the IRI specification is the use of UTF-8 for conversion of non-ASCII characters to %HH-escapes in URIs. The two most recent changes to the IRI draft ([IRIdraft]) are disallowing spaces and other US-ASCII characters not allowed in URIs (they were at one point allowed in IRIs), and the use of punycode (rather than UTF-8) for conversion in domain names.
This section looks at various kinds of markup needed for internationalized DTDs/schemas, mostly for documents. Currently, different needs are covered in different specifications. There is currently no specification that lists these needs, and even less a specification that would allow to integrate the necessary elements and attributes into a specific DTD/schema. It is not clear whether this may be needed to be done as part of a larger specification (e.g. as a module in XHTML 2.0), or as a separate specification or namespace.
Some of the needs for internationalization can to a certain extent be covered by some of the formatting characters of [Unicode4]. However, if markup (and styling) is available, it is preferable for many reasons to use this more explicit, flexible, and structured information. For details, see [UnicodeInXML].
A consequence of using markup for various internationalization needs is that in a good DTD/schema design, natural language running text should always be element content rather than attribute content, because attributes cannot contain markup. Another consequence is the desire to have better support on various levels for what we can call "text with markup". This among else means an easy transition from strings without markup to strings with markup (rather than for example a high-level division between simple and complex types as in XML Schema), better support for storing text with markup in databases.
Language tagging, in the form of the xml:lang attribute, is defined in XML 1.0. This is very helpful for text-to-speech applications, glyph disambiguation (in particular Han ideographs), and many other purposes. The main problem currently is that there is no support for this attribute or similar attributes that work by inheritance in technologies such as XSLT. This makes copying language information from a source document to an output document very tedious. We hope that this can be addressed in XSLT 2.0.
Bidirectionality refers to the mixture of text displayed left-to-right and right-to-left when scripts such as Arabic or Hebrew are involved. The modern approach to bidirectionality, storing information in logical order, makes bidirectionality a problem of visual rendering only, and therefore seems to limit it to it seem as if bidirectionality is strictly a styling problem. This is not true, because without certain structural information about bidirectional embeddings or overrides, it is impossible to display certain bidirectional text in a readable way.
HTML (see [HTML4] Section 8.2) is a good example of how to make these features part of the actual document markup. Properties have been added to CSS to be able to define default stylesheets for markup languages with such features. The bidirectionality-related properties in CSS are not intended to be used with a per-document or even finer granularity.
Ruby are short annotations or glosses mainly used in East Asian typography to indicate pronunciation of ideograms. Different from many other typographic phenomena found around the world, they require markup to indicate their structure. Because ruby didn't make it into HTML 4.0, it became a standalone Recommendation [RubyAnnotation] and part of XHTML 1.1, with the necessary styling support being worked on in the context of CSS3.
Unicode encodes an extremely large number of characters, but on occasion, new characters get created (e.g. symbols for a new Mathematical theory, or a ideographs for the name of a new baby). Also, it is sometimes necessary to specify specific glyph shapes of certain characters as part of document or data (rather than simple styling), while Unicode only encodes characters but not glyphs. For XML, the best way to do this is again via markup. [CharGlyph] gives an overview of the issues and solutions adopted in different specifications. Some more coordination with the goal of a reusable solution seems desirable. A working group of the Text Encoding Initiative (TEI) is currently looking at this and related problems.
Often it is known or to some degree expected that documents will be translated and localized into various languages. In such cases, a DTD/schema will benefit from the general internationalization considerations in the previous section. There are a number of additional design considerations for document localization, for example information that allows to specify that certain terms should not be translated. Richard Ishida gives a very good overview of this topic in [LocalizableDTD].
Localization for documents mainly means translations, which up to this point in time is mostly a very time-consuming process. Localization of data, however, can in many if not most cases be done automatically. In the context of world-wide connectivity, this raises various issues that are discussed in detail in [WWLocalization]. As much as possible, data and functionality should be kept independent of any particular local convention for representation and presentation, and conversion to such conventions should be pushed to the edges of the infrastructure. Also, because it is not clear on the Web who will look at data, it can be useful to build in some redundancies. [March 2nd, 2004] is much clearer in all cases than [02/03/04] or some other combination of numbers.
It is easy to say that localization should be pushed to the edges (i.e. the clients in a client-server setup), but it may not always possible. [WWLocalization] discusses cases where this would lead to too much data transfer (e.g. querying/sorting/collation), with possible privacy implications, or where the data or code for some localization operation may not be available locally. In particular for XML Query and Web services, this raises the question of how to identify and communicate localization preferences. Web technology (i.e. XML, URIs, Web services) not only raises new problems, but can provide new solutions. [LDML] is an XML-based format for localization preferences. A proposal for collation identifiers is available at [ComparatorRegistry].
The need for locale-independent data interchange mainly affects XML Schema. Although XML Schema datatypes use textual representations somewhat biased towards English or Western conventions, most datatypes have value spaces that are largely culturally neutral.[3] Culturally neutral datatypes do not allow the identification of abstract datatypes by a schema. These are cases where the type of a datum is known, but the lexical representation may be a free-form representation not directly accessible to automatic processing, and the actual value may not be known precisely. The best solution currently is to use a typed attribute on free-form element content, for example:
<date value='YYYY-MM-DD'>free-form date</date>
In the long term, the relationship between abstract datatypes (e.g. some not necessarily specified day in time), concrete values (a particular day in time), lexical representations (culturally neutral or not), and possible operations on a datatype (i.e. a datatype as an object in the sense of object-oriented programming) may need some further careful studies.
XSLT 1.0 started to provide functionality for conversion to localized representations (e.g. for numbers), but a lot remains to be done. Ideally, technology will be designed to allow the addition of localization functionality by third parties; this is the easiest way to help getting languages and conventions covered that may not be of primary commercial interest.
[Activity] W3C Internationalization Activity, Web page at http://www.w3.org/International/.
[AuthoringTech] Richard Ishida, Authoring Techniques for XHTML & HTML Internationalization 1.0, W3C Working Draft 9 October 2003, available at http://www.w3.org/TR/i18n-html-tech.
[CharGlyph] Missing Characters and Glyphs, Web page at http://www.w3.org/International/O-MissCharGlyph.
[CharacterModel] Martin J. Dürst, François Yergeau, Richard Ishida, Misha Wolf, and Tex Texin, Character Model for the World Wide Web 1.0, W3C Working Draft 22 August 2003, available at http://www.w3.org/TR/charmod/.
[CharReq] Martin J. Dürst, Requirements for String Identity Matching and String Indexing W3C Working Draft 10-July-1998, available at http://www.w3.org/TR/WD-charreq.
[ComparatorRegistry] Chris Newman, Internet Application Protocol Comparator Registry, Internet Draft draft-newman-i18n-comparator-00.txt (work in progress), available at http://www.ietf.org/internet-drafts/draft-newman-i18n-comparator-00.txt.
[NFCcheck] Martin J. Dürst, Efficient implementation of Normalization checking for XML 1.1, available from http://www.w3.org/2003/06/xml1.1test/.
[HowNormalization] Cliff Schmidt, How Normalization Standards Are Helping and Hindering the Success of XML: Data Interchange on the Web and the W3C Character Model Specification, Proc. XML 2002, December 2002, Baltimore, MD, U.S.A., available at http://www.idealliance.org/papers/xml02/dx_xml02/papers/06-01-03/06-01-03.html.
[HTML4] Dave Raggett, Arnaud Le Hors, and Ian Jacobs, HTML 4.01 Specification, W3C Recommendation 24 December 1999, available at http://www.w3.org/TR/html4.
[IRI2001] Martin J. Dürst, Internationalized Resource Identifiers: From Specification to Testing, Proc. 19th International Unicode Conference, September 2001, San Jose, CA, U.S.A., available at http://www.w3.org/2001/Talks/0912-IUC-IRI/paper.
[IRIdraft] Martin J. Dürst and Michel Suignard, Internationalized Resource Identifiers, Internet Draft draft-duerst-iri-04.txt (work in progress), available from http://www.w3.org/International/iri-edit/.
[LDML] Free Standards Group Open Internationalization Initiative, Locale Data Markup Language Specification 1.0, June 2004, available at http://www.openi18n.org/specs/ldml/.
[LocalizableDTD] Richard Ishida, Localizable DTD Design, Multilingual Computing, Volume 13 Issue 5, available at http://www.multilingual.com/ishida49.htm.
[mod_fileiri] Martin J. Dürst, FileIRI Module, available at http://cvs.w3.org/Team/apache-modules/mod_fileiri/?cvsroot=Public.
[RubyAnnotation] Marcin Sawicki, Michel Suignard, Masayasu Ishikawa, Martin Dürst, and Tex Texin, Ruby Annotation, W3C Recommendation 31 May 2001, available at http://www.w3.org/TR/ruby.
[RXP] Richard Tobin, RXP - an XML parser available under the GPL, available at http://www.cogsci.ed.ac.uk/~richard/rxp.html.
[UAX#15] Mark Davis and Martin Dürst, Unicode Normalization Forms, Unicode Standard Annex #15, available at http://www.unicode.org/reports/tr15.
[Unicode4] The Unicode Consortium, The Unicode Standard Version 4.0, Addison-Wesley, Reading, MA, U.S.A., 2003, available at http://www.unicode.org/versions/Unicode4.0.0/bookmarks.html.
[UnicodeInXML] Martin Dürst and Asmus Freytag, Unicode in XML and other Markup Languages, Unicode Technical Report #20, W3C Note, June 2003, available at http://www.unicode.org/reports/tr20/ or http://www.w3.org/TR/unicode-xml/.
[WSUsage] Kentaroh Noji, Martin J. Dürst, Addison Phillips, Takao Suzuki, Tex TexinWeb Services Internationalization Usage Scenarios, W3C Working Draft 16 May 2003, available at http://www.w3.org/TR/ws-i18n-scenarios.
[WWLocalization] Martin J. Dürst, World Wide Localization, Proc. 23th Internationalization and Unicode Conference, March 2003, Prague, Czech Republic, available at http://www.w3.org/2003/Talks/0324WWL/paper.html.
[1] Unfortunately, element and attribute names as well as PIs and comments do not allow NCRs, which means that conversion from a legacy encoding to a Unicode-based encoding is always possible, but conversion in the other direction may fail.
[2] More exactly, all Names and Nmtokens in the XML syntax.
[3] The exception are types prefixed with a "g", such as gDay and gMonth, where the value space is closely tied to the Gregorian calendar.
![]() ![]() |
Design & Development by deepX Ltd. |